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Metric Spaces 
Description of signal spaces and metric spaces. 


Signal Spaces 


We start the content of our course by defining its main concepts of a signal 
and a signal space. 


Definition 1 A signal is the value of some quantity as a function of time, 
space, frequency, etc.; each signal is labeled by a lower-case letter zx. 


Definition 2 A signal space is a set of signals defined by some criterion, 
labeled by an upper-case letter X (since it is a set). 


Some familiar sets of signals are_X = R, X = C, and the set of vectors 
X — R”. 


Definition 3 The signal space L [a, b contains all signals x(t) such that 
z(t) = 0 forallt <aort > band [’ |x (t)|?dt < 00 (ie., at no time the 
signal is infinite). 


Metric Spaces 


Definition 4 A metric d: X x X — Risa function used to measure 
distance between pairs of elements of X with the following properties: for 
all zw, y,2z © X, 


d(x,y) = d(y, x) (symmetry), 
ae. >0 Gaon: -negativity), 


y) 
Oe; y) = 0 ae Y; 
. a(x, z) < d(x, y) + d(y, z) (triangle inequality). 


If d is a metric on X, the pair (X, d) is called a metric space. A set X can 
have multiple metrics, leading to different metric spaces. 


Example 1 The following are some initial examples of metric spaces: 


e X = R with do (z, y) = |x — y| for all x, y € R: it is easy to check 
properties (1-4). 


1 if 
+ X=Rwith d (cy) = | pak 


; for all z, y € R: it is easy to 
OQ. ir. aay 


check properties (1-3). To verify (4), assume that 
d(x, y) + d(y, z) = 0; then both d(z, y) = 0 and d(y, z) = 0, which 
means x = yand y = 2; by transitivity, = z and 
d(x, z) =0 < d(z, y) + d(y, z), as desired. Now assume that 
d(x, y) + d(y, z) = 1; then we immediately get 
d(x, z) < d(x, y) + d(y, z), as desired. 
° X =R" with metric d? (w,y) = (0, |ai — ys?) /”, known as 
the Euclidean metric. 
e X = R” with metric d? (z, y) := = ee ou 


1/2 
e X = L, [a,b] with metric dz (x =f |x (t) — y(t)| *at) 


Formally, dz is a cuienion on es , since ce are cane x # y that 
yield dz (x, y) = 0. However, one can define a new signal space 
where all signals with d(x, y) = 0 are equal to each other. 


1/p 
¢ X = Ly [a,b] with metric d, ( a |x (t — y(t)|Pat) , for 


1 < p < ow, an extension of the metric do. 
¢ X = Ly |a, bj with metric dx (x, y) :=suPzeja.) |e (t) — y(t)]; 
metric solves the equivalence problem of do. 


Here, sup (A) is the supremum of A, i.e., the smallest value xo € R such 
that x < xp V x € A. Similarly, inf (A) is the infimum of A, i-e., the 
largest value 2q such thatrp <a V@E A. 


Vector Spaces 
Description of vector spaces, subspaces, bases and spans. 


Vector Spaces 


Definition 1 A linear vector space (X, R, +, -) is given by a signal space X (called 
vectors), a set of scalars R, an addition operation + : X x X — X, and a multiplication 
operation -: R x X — X, such that: 


1. X forms a group under addition: 


aVa,yEeX Ala+yeX, (closed under addition) 

b.40¢€ X suchthatO+ X=X+0=X. (additive identity) 
cVxexX dyeXsuchthatzr+y=0, (additive inverse) 
d.Va,yzExX x+(y+z)=(e@+y)+2z. (associative law) 


2. Multiplication has the following properties: for any x, y € X anda,be€ R: 
a.a-x2€X, (closure in X under multiplication) 
b.a-(b- x2) =(a-b)-az, (compatibility) 
c(a+b)-x=a-x+6-2, (distributive law over R) 
d.a-(a@+y)=a-x+a-y. (distributive law over X) 

3. The set R has the following properties: 


a. There exists 1 € Rst.1-x2=2 Va2eEX, (multiplication identity) 
b. There existsO0 € Rst.0-x2=0 VaeX. (multiplicative null element) 


Example 1 Here are some examples of vector spaces: 


e X = R” (space of all vectors of length n) over R = R is a vector space. 

e X = C” (C is complex numbers) over R = C is a vector space. 

e X = R” over R = C is not a vector space, because closure in X under 
multiplication is not met. 

¢ X = C'|T] (continuous functions in T) over R = R is a vector space. 


Subspaces 


Definition 2 A subset M C X is a linear subspace of X if M itself is a linear vector 
space. Note that, in particular, this implies that any subspace M must obey 0 € M. 


Example 2 Here are some examples of subspaces: 


e In X = R? over R = R, any line that passes through the origin is a subspace of X: 


Equation: 
Mi = { (24) ER? such that 2 = cl. 
ne 


¢ In X = CT] over R = R, the followings are subspaces of X: 
Equation: 


M, =({f (x) =az*+br+c:a,b,ceE R} 
Mz ={f (x): f (zo) = Of. 


In contrast, the set M3, = {f (x) : f (vq) = a £ 0} is not a subspace. 
Proposition 1 If M/ and N are subspaces of X, then M ™ N is also a subspace. 


Proof: We assume that M and N hold properties of linear vector space, and show that so 
does MM N: 


1. Equation: 


zyeEM => «+yEeM 


MAN => 
ae eee => a«rt+yEeN 


} > «r+yeMnNn 


2. Equation: 


ie linear vector space > 0 € M 


=0EMNN 
N linear vector space > 0 € N } 


3. Equation: 


zEM sayeM st. «ct+y=0 


MaAN=> 
oo ee dyEN st. xct+ty=0 


souemnn 


The other properties are shown in a similar fashion. 


Definition 3 A vector x € X, where (X, R,+,-) is a vector space, is a linear 
combination of a set {11, £2,...,n} C X if it can be written as x = So, a; - Ui, 
a; € R. The set of all linear combinations of a set of points {21, £2, ...,2,} builds a 
linear subspace of X. 


Example 3 Q = )~7_, a;2’ is a linear subspace of (C[T], R, +, -) containing the set of 

all quadratic functions, as it corresponds to all linear combinations of the set of functions 
2 

{x es Tt: 


Bases and Spans 


Definition 4 For the set S = {21, £2, ...,£,} C X, the span of S is written as 
Equation: 
n 
[S] = span(S)=¢2:2= a;xj,a; © Rp. 


1=1 


Example 4 The space of quadratic functions Q is written as Q = [S;], with 

Sy = {x?,x,1}. The space can also be written as [S'2] with Sp = {1, x, x” — 2}), ie. 
[Si] = [2]. To prove this, we need to show [S2] C [S41] and [$1] C [$2]. For the former 
case we have 

Equation: 


x = a, + a2z + a3 (2? — 2) 


= (a; — 2a3) + age + age’, 
which means that every element that can be spanned by S», can also be spanned by S}, 
and hence [2] C [S$]. The latter case can be shown in a similar manner. 


Definition 5 A set S is a linearly independent set if 
Equation: 


n 
So ain; =0<a,;=0,Vie 41.5.2; sae pe 
i=1 


Otherwise, the set S is linearly dependent. 


Definition 6 A finite set S of linearly independent vectors is a basis for the space X if 
[|S] = X, ie. if X is spanned by S. 


Definition 7 The dimension of X is the number of elements of its basis |.S|. A vector 
space for which a finite basis does not exist is called an infinite-dimensional space. 


Theorem 1 Any two bases of a subspace have the same number of elements. 


Proof: We prove by contradiction: assume that S$; = {x1,...,@,} and S2 = {y,..., ym} 
,m > n, are two bases for a subspace X with different numbers of elements. We have 
that since y; € X it can be written as a linear combination of Sj: 

Equation: 


n 
Y= S a;L;. 
i=l 


Order the elements of S; above so that a, is nonzero; since y; must be nonzero then at 
least one such a; must exist. Solving the above equation for x1 yields 
Equation: 


1 n 
1 = —_{ vy S AjLX; 
cl i=2 


Thus {y1, £2, £3, -.-) n} is a basis, in terms of which we can write any vector of the 
space X, including yo: 
Equation: 


n 
yo = biyi + S° bj xj. 
id 


Since y1, y2 are linearly independent, at least one of the values of b;, 2 > 2, must be 
nonzero. Sort the remaining x; so that by is nonzero. Solving for x2 results in 
Equation: 


1 by "5b; 
r2 = Be = ba a es a 


Therefore, {y1, Y2, £3, ---) £,} is a basis for X. Continuing in this way, we can eliminate 
each x;, showing that {y1, y2, -.-, yn} is a basis for X. Thus, we have yni1 = 0,4 CiYi 
, or equivalently: 

Equation: 


Cn4i1ynti + cyi =O with cyyi=-1. 


n 
w=1 


As aresult, Sg is linearly dependent and is not a basis. Therefore, all bases of X must 
have the same number of elements. 


Basis Representations 


Having a basis in hand for a given subspace allows us to express the points in the 
subspace in more than one way. For each point x € [S] in the span of a basis 
= {Si, Deere a that is, 

Equation: 


there is a one-to-one map (i.e., an equivalence) between x € [.S] and 
a = {aj,...,@n} € R”, that is, both x and a uniquely identify the point in S. This is 
stated more formally as a theorem. 


Theorem 2 If S is a linearly independent set, then 
Equation: 


3 aj;Sj = S b;S; 
i=1 i=1 


it-and only it-as = b; for? = 1, 2..n. 


Proof: Theorem 1 states that the scalars {a, ..., @,} are unique for x. We begin by 
assuming that indeed 


Equation: 
Sais = Shs 
i=l i=1 
This implies 
Equation: 
S° aj,Sj = Sob: 5; = 0, 
i=l i=1 
S > (ai — bi) Si =0 
w=1 


Since the elements of S are linearly independent, each one of the scalars of the sum must 
be zero, that is, a; — 6; = 0 and so a; = b; for eachz = 1,...,n. 


Example 5 (Digital Communications) A transmitter sends two waveforms: 
Equation: 


Sy (t) = 4/2/T cos (2xf-t) t€[0,T] if bit 1 is transmitted, 
Equation: 


So (t) = /2/T sin (2rf.t) t¢€[0,T] if bit 0 is transmitted. 


The signal r(t) recorded by the receiver is continuous, that is, r(t) € C[T]. Assuming 
that the propagation delay is known and corrected at the receiver, we will have they the 
received signal must be in the span of the possible transmitted signals, i-e., 

r(t) Espan (S$; (¢), So (¢)). One can check that S} (t) and S$» (t) are linearly 
independent. Thus, one can use a unique choice of coefficients ag and a; that denote 
whether bit 0 or bit 1 is transmitted and contain the amount of attenuation caused by the 
transmission: 

Equation: 


T (t) =aj,S} (t) + aoSo(t). 


The uniqueness of this representation can only be obtained if the transmitted signals 
So (t) and Sj (t) are linearly independent. The waveforms above are used in in phase 
shift keying (PSK); other similar examples include frequency shift keying (FSK) and 
quadrature amplitude modulation (QAM). 


Normed Spaces 
Description of norms, normed spaces, and Banach spaces 


Distances and metrics allow us to evaluate how different two signals are 


from each other. Norms allow us to evaluate how “big”, “important”, or 
“interesting” a given signal is. 


Definition 1 Assume X is a linear vector space. A norm on X is a function 
|| - ||: & — R with the following properties for all z, y € X anda € R: 


1. || x ||> 0, (non-negativity) 

2. || x ||= 0 if and only if x = 0 (zero norm for zero vector) 
3. || a- x ||= |al- || a ||, (scaling) 

4. || ¢+y ||<|| a || + || y ||. (wiangle inequality) 


|| x || is read as the norm of x or length of z. 


Intuitively, one can say that || x || is the distance between x and the zero 
vector (more on this soon). 


Definition 2 A vector space X with a norm || - || is called a normed linear 
vector space(X, || - ||) (or a normed space for brevity). 
Definition 3 Let (X, || - ||) be a normed space. The induced metric or 


induced distance is given by dz (x, y) =|| x — y |. 


Definition 4 If a normed space is complete under the induced metric, then it 
is called a Banach space. 


All norms induce distances, but not all distances are induced by norms. 


Example 1 Consider the distance 
Equation: 


0 if x=y, 


aoa) = . if cf y. 


Let us assume that there exists a norm || x ||; = d’ (x, 0) that would induce 
this distance. We would then have for x 4 0 anda ¢ {—1, 0, 1} that 

|| ax ||; = d’ (axz,0) =1and || z ||; =’ (x, 0) = 1, which contradicts 

|| ax ||; = |a| || x ||;. Thus || - ||, is not a valid norm. 


In contrast, here are some examples of valid norms. 


Example 2 The vector space X = C|T| accepts the norm 

| z ||, =supyer |x (t)|. The induced distance is 

d; (x,y) =supter |x (t) — y(t)| = dq (2, y); it is straightforward to 
prove properties (1—4). We previously showed that the metric space 
(C'|T], d..) is complete, and so (C'[T], || - ||.) is a Banach space. 


Example 3 The vector space _X = R” accepts the norm 


1/2 
Eales ( a |”) . The induced metric is 


1/2 

n 2 

di(z,y) = ||e@—yllp= pe jzi — yi ) = dz (x,y), the 
Euclidean distance. Thus || - ||, is known as the Euclidean norm. The 
spaces (IR”, dz) are Banach spaces for all values of n. 


Example 4 The vector space X = |0, 1) accepts the norm || x ||= |z|. The 
induced metric d; (x, y) = |x — y| = do (a, y) is the standard metric for 
the reals. Since we have previously shown that (X, dg) is not a complete 
vector space, then the space ((0, 1], || - ||) is not a Banach space. 


Inner Product Spaces and Hilbert Spaces 
Review of inner products and inner product spaces. 


Inner Products 


We have defined distances and norms to measure whether two signals are different from each 
other and to measure the “size” of a signal. However, it is possible for two pairs of signals with 
the same norms and distance to exhibit different behavior - an example of this contrast is to pick 
a pair of orthogonal signals and a pair of non-orthogonal signals, as shown in [link]. 


d(x1,x2) we 


Orthogonal Non - Orthogonal 


An example of orthogonality in a two-dimensional 
space: distances and norms are not indicative of 
orthogonality; two pairs of vectors with the same 
distance can have arbitrary angle between them. 


To obtain a new metric that distinguishes between orthogonal and non-orthogonal we use the 
inner product, which provides us with a new metric of “similarity”. 


Definition 1 An inner product for a vector space (X, R, +,-) is a function (-,-): X x X > R, 
sometimes denoted (-|-), with the following properties: for all x,y,z € X anda € R, 


1. (a, y) = (y, £) (complex conjugate property), 

2. (x + y, z) = (x, z) + (y, z) (distributive property), 
3. (ax, y) = a(x, y) (scaling property), 

4. (x, x) > Oand (x, x) = Oif and only ifz = 0. 


A vector space with an inner product is called an inner product space or a pre-Hilbert space. 
It is worth pointing out that properties (2-3) say that the inner product is linear, albeit only on the 


first input. However, if R = R, then the properties (2-3) hold for both inputs and the inner 
product is linear on both inputs. 


Just as every norm induces a distance, every inner product induces a norm: ||x||; = «/ (x, x). 


Hilbert Spaces 


Definition 2 An inner product space that is complete under the metric induced by the induced 
norm is called a Hilbert space. 


Example 1 The following are examples of inner product spaces: 


1. X = R” with the inner product (x, y) = )°°_, x;y; = y? x. The corresponding induced 
norm is given by ||a||, = «/(x, 2) = roe x? = ||z\|,, ie., the £2 norm. Since 


(R”, || - ||2) is complete, then it is a Hilbert ae 
2. X = C/T] with inner product (z, y) = [ae t)dt. The corresponding induced norm is 


elle 47 facel t)dt = ||a||,, ie., the LD» norm. 


ae i we ne o ei C|T] to be complex-valued, then the inner product is defined by 
=m re t)dt, and the corresponding induced norm is 


ae = iz tat = 4/ fp la (t) [Pat (at = [ll 
4, X = C” with inner ee (x,y) = S_, viyi = ya; here, x” denotes the Hermitian 
of x. The corresponding induced norm is ||z||; = gs |xi|” = |la||o. 
Theorem 1 (Cauchy-Schwarz Inequality) Assume X is an inner product space. For each 


x,y © X, we have that |(x,y)| < ||z||,||yl|,, with equality if (2) y = ax for some a € R; (it) 
ax =. 03 or (222) yy = 0: 


Proof: We consider two separate cases. 


° ify =O then (x,y) = (y,a) = (0-y,2) = O(y,x) = 0(z,y) =0= || a |I,|| y ||;. The 
proof is similar if x = 0. 
° Ifz,y #~ OthenO < (x — ay, x — ay) = (x, 2) —aly,z) —a(z,y) + aa (y, y), with 


equality if z — ay = 0,i.e., x = ay forsomea € R. Now set a = eu , and soa = wet . 
We then have 
Equation: 

(x, y) (y; ) (x,y) (y, #) 

0 < (a,x) — (y, x) — (x,y) + (y, 9) 
(YY) (y, ¥) (Y¥) (Ys ¥) 
(x,y) (£,y) l(a, y) |? 
S$ (2,2) - PSS = |le|?- 2 
all” 1 

This implies Kewl < Teale and so since all quantities involved are positive we have 


ly i 
(x, ¥)| < [lel] - lly. 


Properties of Inner Products Spaces 


In the previous lecture we discussed norms induced by inner products but failed to prove that 
they are valid norms. Most properties are easy to check; below, we check the triangle inequality 
for the induced norm. 


Lemma 1 If || 2 ||; = (2,2), then || ¢+y ||; < | 2], + lly Ih. 


From the definition of the induced norm, 
Equation: 
j@+yll; =(e+ye+y), 
= (x, a) + (@,y) + (y,@) + (ys ¥)s 
= |lell; + (2,9) + (2,9) + [yl 
= ||2||; + 2real ((x,y)) + [lyll;- 


At this point, we can upper bound the real part of the inner product by its magnitude: 
real ((x, y)) < |(x, y)|. Thus, we obtain 
Equation: 
lx tlle < llelly +2\(e,y)] + lal, 
S |lell? + 2llelslylls + lvl, 
S (lll; + llyill)’, 


where the second inequality is due to the Cauchy-Schwarz inequality. Thus we have shown that 
|x + yl; < |x|; + |ly||,- Here's an interesting (and easy to prove) fact about inner products: 


Lemma 2 If (x,y) = 0 for all x € X then y = 0. 


Proof: Pick x = y, and so (y, y) = 0. Due to the properties of an inner product, this implies that 
y= 0. 


Earlier, we considered whether all distances are induced by norms (and found a 
counterexample). We can ask the same question here: are all norms induced by inner products? 
The following theorem helps us check for this property. 


Theorem 2 (Parallelogram Law) If a norm ||-|| is induced by an inner product, then 
lo + yl? + lla — yl? = 2 (Tall? + llgll”) for all x,y € X. 


This theorem allows us to rule out norms that cannot be induced. 


Proof: For an induced norm we have ||z||” = (a, x). Therefore, 
Equation: 


jz+yll’+lz—yll? =@+y,c+y)+ @—y,2-y), 
as (@, 2) =e (x, y) + (Y, x) ae (y, y) ale (x, 2) p (x, y) _ (Y, x) ‘lg (YY); 


Example 2 Consider the normed space (C |T], Lo), and recall that ||z||,, =supzer |x (t)|. If 
this norm is induced, then the Parallelogram law would hold. If not, then we can find a 
counterexample. In particular, let T = (0, 27], x(t) = 1, and y(t) =cos (t). Then, we want to 
check if ||z + y||° + ||e — yl’ =2 (\lz? oF lull”). We compute: 


Equation: 


hoo 


lly [o-e) = 1, 
lz + yl|,¢ = ||1+ cos (£)|| =sup |1+ cos (¢)| =1+1=2, 
teT 


lz — y||,,. = |]1- cos (€)|| =sup |1— cos (t)| = 1 — (—1) = 2. 
teT 


Plugging into the two sides of the Parallelogram law, 
Equation: 


27-2? = 2(1? 4-17), 
8 =4, 


and the Parallelogram law does not hold. Thus, the Z., norm is not an induced norm. 


Orthogonality 
Definition of orthogonal vectors, sets, and subspaces; properties and benefits. 


Recall that a set S is a basis for a subspace X if [S] = X and S has linearly 
independent elements. If S is a basis for X then each x € X is uniquely 
determined by (a1, a2, ...,@n) such that }>\"_, a;s; = a. In this sense, we could 
operate either with z itself or with the vector a = |aj,...a,,] = R”. One would 
wonder then whether particular operations can be performed with a representation 
a instead of the original vector x. 


Example 1 Assume xz, y € X have representations a, b € R” in a basis for X. 
Can we say that (x, y) = (a,b)? 


For the particular example of X = Lz [0,1], S = (a, & t?} so that [S] = Q, the 
set of all quadratic functions supported on [0, 1]. Pick « = 2+¢ +t? and 
y=1+ 2¢+ 3¢7. One can see then that if we label s; = 1, s2 = t, s3 = t’, then 
the coefficient vectors for x and y are a = [2 1 1] andb = [1 2 3], respectively. 
Let us compute both inner products: 

Equation: 


_ 187 
— 2 


(x, y) = [ etoumae =f (2+¢+t7) (1+ 2¢ + 32?)de ~ 9.35, 


(a,b) =24+2+4+3=7. 
Since 7 # 9. 35, we find that we fail to obtain the desired equivalence between 
vectors and their representations. 


While this example was unsuccessful, simple conditions on the basis S will yield 
this desired equivalence, plus many more useful properties. 


Several definitions of orthogonality will be useful to us during the course. 


Definition 1 A pair of vectors x and y in an inner product space are orthogonal 
(denoted x | y) if the inner product (zx, y) = 0. 


Note that 0 is immediately orthogonal to all vectors. 


Definition 2 Let X be an inner product space. A set of vectors S CX is 
orthogonal if (x,y) = Oforallz,y€ S,x Ay. 


Definition 3 Let X be an inner product space. A set of vectors S CX is 
orthonormal if S is an orthogonal set and || s |= ./ (s,s) = 1 forall s € S. 


Definition 4 A vector x in an inner product space X is orthogonal to a setS C X 
(denoted x | S)ifx 1 y forall y € S. 


Definition 5 Let X be an inner product space. Two sets S; C X and Sz C X are 
orthogonal (denoted S$; | S2)if x | y forall x € S; andy € so. 


Definition 6 The orthogonal complement S* of a set S is the set of all vectors 
that are orthogonal to S. 


Benefits of Orthogonality 


Why is orthonormality good? For many reasons. One of them is the equivalence 
of inner products that we desired in a previous example. Another is that having an 
orthonormal basis allows us to easily find the coefficients a1, ...a,, of x in a basis 


S. 


Example 2 Let z € X and S be a basis for X (i.e., [S] = X). We wish to find 
Q1,...A,, such that x = 4 a; $;. Consider the inner products 
Equation: 


n n 
(z,8;) = S| 4585, 81 = S/ a;(s;,8:), 


due to the linearity of the inner product in the first term. If S' is orthonormal, then 
we have that fori ¢ 7 (s;, s;) = 0. In that case the sum above becomes 
Equation: 


2 
(@,8;) = a;(8;, 8;) = aj|| 8; ||" = a, 

due to the orthonormality of S. In other words, for an orthonormal basis S one can 

find the basis coefficients as a; = (x, s;). 

If S is not orthonormal, then we can rewrite the sum above as the product of a row 


vector and a column vector as follows: 
Equation: 


a1 


CONC CN ace or | (ie 


an 


We can then stack these equations for 2 = 1, ..., to obtain the following matrix- 
vector multiplication: 


Equation: 
(a, 81) (81,81) (82,81) +++ (8n;81)] [ay 
(x, $2) (81,82) (82,83) *+: (Sn,$2)) |as 
(£, 8n) (81, $n) (82, $n) ae (Sn; $n) an 


The nomenclature given above provides us with the matrix equation 6 = G - a, 
where £ and G have entries 8; = (x, s;) and G;,; = (s;, s;), respectively. 


Definition 7 The matrix G above is called the Gram matrix (or Gramian) of the 
set S. 


In the particular case of orthonormal S, it is easy to see that G = J, the identity 
matrix, and so a = (@ as given earlier. For invertible Gramians G,, one could 
compute the coefficients in vector form as a = G8. For square matrices (like G 
), invertibility is linked to singularity. 


Definition 8 A singular matrix is a non-invertible square matrix. A non-singular 
matrix is an invertible square matrix. 


Theorem 1 A matrix is singular if G - x = 0 forsome x ¥ 0. A matrix is non- 
singular if G - x = 0 only for x = 0. 


The link between this notion of singularity and invertibility is straightforward: if 
G is singular, then there is some a # 0 for which G - a = 0. Consider the 
mapping y = G - x; we would also have y = G(x + a). Since x # x + a, one 
cannot “invert” the mapping provided by G into y. 


Theorem 2 S is linearly independent if and only if G is non-singular (i.e. Gaz = 0 
if and only if « = 0). 


Proof: We will prove an equivalent statement: S is linearly dependent if and only 
if G is singular, i.e., if and only if there exists a vector x ~ 0 such that Ga = 0. 


(=) We first prove that if S is linearly dependent then G is singular. In this case 
there exist a set {a;} C R, with at least one nonzero, such that }>""_, ais; = 0. 
We then can write ()>”_, a;8;,8;) = (0,8,;) = 0 for each s;. Linearity allows us 
to take the sum and the scalar outside the inner product: 

Equation: 


n 
S° aj(S;, 8;) = 0. 
i=1 


We can rewrite this equation in terms of the entries of the Gram matrix as 
Soi, aGj; = 0. This sum, in turn, can be written as the vector inner product 
Equation: 
Qa) 
[Gj +++ Gin] |} : | =0, 


an 


which is true for every value of 7. We can therefore collect these equations into a 
matrix-vector product: 
Equation: 


Gu -*: Gin} [a 0 
Gni a Gann an 0 
Therefore we have found a nonzero vector a for which Ga = 0, and therefore G 


is singular. Since all statements here are equalities, we can backtrack to prove the 
opposite direction of the theorem (<=). 


Pythagorean Theorem 


There are still more nice proper ties for orthogonal sets of vectors. The next one 
has well-known geometric applications. 


Theorem 3 (Pythagorean theorem) If x and y are orthogonal (a | y), then 
2 2 2 
Fei +iy i =lle+y|/. 


Proof: 
Equation: 


|oty |? =(e+y,a+y) = (a,x) + (x,y) + (y,2) + (yy) 


Because x and y are orthogonal, (x, y) = (y, x) = 0 and we are left with 
2 2 
(e, 2) = || # ||" and (y,y) = || y ||. Thus: || 2 + y ||" =|] 2° +l yl’. 


Projection Theorem 
Introduces the concept of subspace projection and the optimality given by the projection theorem. 


Uniqueness of Decompositions 

For the next property, we need two quick definitions. 

Definition 1 The sum of two spaces S$, S2 C X is the set $1 + Sp = {a+y:2€ Si,y € So}. 
Definition 2 The subspaces S, 5, C X are disjoint if S$; S2 = {0}. 


Theorem 1 Let V, W C X be subspaces. For each x € V + W there exists a unique pairv € V,w © W 
such that x = v+ w if and only if V and W are disjoint. 


Proof: We prove the two directions of this “if and only if” statement separately. 


(=) If for each x € X there exists a unique pair v, w € V x W such that z = v + w, then the subspaces V 
and W are disjoint. 


Assume there exists a unique pair v, w s.t. 2 = u+ w for each x € V + W. For the sake of contradiction, 
assume V and W are not disjoint, i.e. there exists some z € VM W such that z # 0. Pick x and the 
corresponding v, w such that z = v + w. Then, 
Equation: 
L=v+wt+Zz-Z, 
=vU+zZ+w- Zz. 


We examine these two terms in the equation: 


Equation: 
v+z:vEV, zEVSev4ZzeEV, 
Equation: 
w-—z:wewWw, zewWs>w-zew. 
Also note 
Equation: 


v+zA#Av and w—zAw since z£0. 


So we have two pairs of elements from V and W which sum to 2, a contradiction to the assumption that these 
pairs are unique. Therefore, V ™ W = {0} and thus V and W are disjoint. 


(<=) If V and W are disjoint, then for each « € V + W there exists a unique pair v, w € V x W such that 
v+t+Ww=t. 


To prove uniqueness, we will assume that two distinct pairs exist and show at the end that they are qual to 
each other. The two pairs we begin with are 
Equation: 


z=v+u,veV andwe W, 


Equation: 
z=pt+qpeVandqew, 

Equation: 

p#vorg#w, 
so the pairs are distinct from each other. Subtract the equations: 
Equation: 

0 =p-—vt+q-uv, 

w-q =p-v. 

We examine these two terms in the equation: 
Equation: 


p-—v:pEV, veVaep-veV, 


Equation: 


w-q:weEWw,qewWw>aw-qew. 


Since those two terms are equal, 
Equation: 


w—qeV and p—vew. 


which means that w — g © VM W and p— v € VN W. Our starting assumption was that VM W = {0}, so 
therefore w — g = 0 and p — v = 0 which implies w = q and p = v. This statement is a contradiction since 
we assumed that the pairs were distinct, i.e., w 4 q and p # v. Therefore, only a unique pair v € V, w © W 
exists such that v + w = z. 


Fact 1 If S is a subspace, then S and S~ are disjoint. To see this, assume « € S$ S*, which means « € S+ 
and so (x, y) = 0 forall y € 9. Pick y = x so (x, x) = 0 > 2 = 0 which means SM S~ = {0}. 


Fact 2 Using Fact 1, we can show that for each x € X there exists a unique pair s € S andt € S* such that 
x = s+. In particular: 


eIfxeS,s=azandt=0. 

° Ife S+,s=Oandt=z. 

e Ife ¢ Sandz ¢ S+, boths,t £0. 
Projection Theorem 


In fact, there is quite a lot more we can say about the projection of a point x € X into a subspace S C X 


Theorem 2 Let X be a Hilbert space and S' be a closed subspace of X. For any vector  € X there exists a 
unique point so € S that is closest to a, ie., 
Equation: 


so =min|| x — s || 


In other words, 


Z—Sp || < || « — 5 || for all s € S with equality only if s = sp. 


Furthermore, so is a minimizer of || x — s || over s € S if and only if x — sp 1 S. In other words, 
2 — 39 € St. 


Proof: To structure our proof, we will restate the theorem into four separate facts: 


1. Existence: A minimizer of the distance || x — s || exists in S. 

2. Orthogonality of the error: If sq is a minimizer, then z — s) | S. 

3. Sufficiency of orthogonality: If « — sq  S, then sq is a minimizer to the distance function. 
4. Uniqueness: only one point that minimizes the distance exists in S. 


Each of these facts is proven separately. 


1. Existence: If « € S, then sg = a is the minimizer (since 0 is the minimum distance). If z ¢ S, we denote 
5 =inf,<g|| 2 — s ||. Note here that we use the infimum, which is the upper bound on all the lower bounds on 
the distance. The infimum is used because we do not yet know if there exists a point in S' that has lowest 
distance to x. Note also that an infimum exists because norms have a lower bound. 


We need to show that for some so € S, we have || 2—So ||= 6. Let {s;} C S be a sequence that yields 
|| 2—s; || 6. We will show that this sequence is a Cauchy sequence; thenm using the fact that the space S is 
closed and Hilbert, it is complete and therefore the Cauchy sequence must converge to a point sy € S. 


To prove that the sequence is Cauchy, we use the Parallelogram Law: 
Equation: 


I (8; 2) + (@ — 5) | +11 (83-2) — (@ — 5) [P_ = 2 (I sj-2 IP + 2-5 IP), 


|| 83-8: |? =2 |] sj;—2 |? +2 || e—s; ||? -|| 83 + 8: — 22 |)?, 
8; + 8; a 
I| $)— 8: |?) = 2 || s;—2 ||? +l] « — 8; ||? - 4||— 5 7 gl . 


Since S is a subspace, —- € §so |la — —— || > 6°, since 6 is the infimum. Therefore, 
Equation: 


| 8) — $4 PS 2 |] 2s; ||? +2] 8; — @ ||? — 40". 
At this point, we note that || s; — s; ||> 0 asi, 7 —> oo, and so the sequence can be shown to be a Cauchy 


sequence. Since {s;} is Cauchy, it converges to a point so and by the triangle inequality: 
Equation: 


|| c—so ||<|| e—s; || + || s; — $0 || for each value of j = 1,2,... 
Now the first term of the inequality goes to 6 as 7 — oo and the second term goes to 0, and so we have that 


|| c—so ||< 6 + € for all values of € > 0. Since 6 was the infimum of the norm on the left hand side, it 
follows that || «—so ||= 6, and so we have found that a minimizer exists. 


2. Orthogonality of the error: We proceed by contradiction by assuming x — SQ is not orthonormal to S, i.e., 
that there exists a unit-norm vector § € S such that (x — 89, 8) = 640. Let z= so + 68 € S. Then 


|| c—so ||<|| 2 — z ||. Furthermore, 
Equation: 


|a—z|? =(@-—za2-2z) = (a — 89 — 68, a — 89 — 68), 
= (£ — 89, — So) + (58, 58) — 2Re((x — 80, 68)), 
=|| e—so ||? +16)7|| 3 ||? — aRe(5 (a = s0,68)). 


Now we recognize two simplifications. First, we selected & so that || & |/?= 1; second, the remaining inner 
product is equal to —2Re (5 (x — 80, 68) = —2Re (55) = —2|6|?. Therefore, 
Equation: 


|| e@—z |? =|] e—s0 |? -|d) < || @— 80 ||’ since 5 £0. 


This is a contradiction, since we have found a z € S closer to x than so. Thus (x — so, 5) = 0 forall 5 € S, 
which means that x — so | S. 


3. Sufficiency of orthogonality: Assume x — so | S. For any § € S with 5 ¥ so, we have 

|| e—3 ||2= || w — s9 + 89 — § ||”. We use the Pythagorean theorem (if a L 6 then 

| a+b ||? =|] a ||? + || b |\?): since a — sy L Sand sy — 5 € S, we have 

|| e—F ||2=|| 2—so ||? +|| so — 5 ||’. Since so 4 3, we have || s9 — 5 ||> 0, and so || —S ||2> || 2 — s9 ||’. 
Now, since we picked § € S arbitrarily, it follows that sp is the minimizer of || z — s || overs € S. 


4, Uniqueness: We can write = « — so + so. From previous parts, we know that x — sy € Stand so € S. 
Since S and S$“ are disjoint, only one pair of values x — so and zp that holds these two properties exists. 
Therefore, the minimizer so is unique. 


Linear Operators 
Introduces linear operators, along with properties and classifications. 


We begin our treatment of linear operators, also known as transformations. 
They can be thought as extensions of functionals that map into arbitrary 
vector spaces rather than a scalar space. 


Definition 1 Let X and Y be be linear vector spaces and D C X. A rule 
A: X — Y which associates every element in x € D with an element 
y = A(x) € Y is said to be a transformation from X — Y with domain D. 


We have defined D because the transformation may only be defined for 
some subset of X. 


Definition 2 A transformation A :X — Y where X and Y are vector spaces 
over a scalar set R, is said to be linear if for every 71, 22 € X andall 
scalars @1,Q2 € R, 

Equation: 


A (a 121 + Q2%2) = a, A (x1) + Q2A (22) 


A common type of linear transformation is the transformation 

A: R” — R”. In this case, A is an m X n matrix with real-valued entries 
(i.e., A € R™*"). There are a variety of linear transformations that arise in 
practice, producing equations of the form A(x) = y, with x € X and 

y € Y, where X and ¥Y are linear vector spaces. For example, the equation 
Equation: 


may be written in operator notation as A(x(t)) = y(t), where 
A: C[T] — C[T] is the operator 
Equation: 


Often, we will simply write y = Az. 


Definition 3 Let (X, || - ||), (Y, || - ||y) be normed vector spaces. A 
linear operator A : X — Y is bounded if there exists a constant M < co 
such that || Aw — Y| < M || a || y for all 2 € X. The smallest M that 
satisfies this condition is the norm of A: 

Equation: 


[Alesy Sine 42 = tune Asp = taae. | Aaely. 
a reX |la|ly eX, lel] ,=1 xeX,||z|| <1 


Geometrically, the operator norm || A || measures the maximum extent A 
transforms the unit circle. Thus, || A || bounds the amplifying power of the 
operator A. 


Operators possess many properties that are shared with functionals, with 
similar proofs. 


Definition 4 A linear operator A : X — Y is continuous on_X if it is 
continuous at any point x € X. 


Theorem 1 A linear operator is bounded if and only if it is continuous. 
Definition 5 The sum of two linear operators A; : X — Y and 
Az: X — Y is defined as (A; + Ap)x = Ayx + Agz. Similarly, the 


scaling of a linear operator A : X — Y is defined as (cA)x = c(Az). Both 
resulting operators are linear as well. 


We can also extend the definition of the dual space to operators. 


Definition 6 The normed space of all bounded linear operators from 
X — Y is denoted B(X,Y) 


Are these spaces complete? 


Theorem 2 Let X, Y be normed linear spaces. If Y is a complete space 
then B(X, Y) is complete. 


Much of the terminology for operators is drawn from matrices. 


Definition 7 Let X be a linear vector space. The operator J : X — X given 
by I(x) = z for all x € X is known as the identity operator, and 
Te B(X,Y). 


Definition 8 Let A; :_ X — Y and Az: Y — Z be linear operators. The 
composition of these two operators (G2G1)x% = G2 (G12) is called a 
product operator. 


Definition 9 An operator A : X —> Y is injective (or one-to-one) if for each 
y € Y there exists at most one x € X such that y = Az. In other words, if 
Ax, — Az then x; = 2p. 


Definition 10 An operator A : X —> Y is surjective (or onto) if for all 
y € Y there exists an x € X such that y = Az. 


Definition 11 An operator A : X —> Y is bijective if it is injective and 
surjective. 


Lemma 1 If A; : X — Y isa bijective operator, then there exists a 
transformation Ay : Y + X.such that Ay (Aix) = x forall x € X. 


Note that the lemma above implies A.A, = I. Thus, we say that A, is 
invertible with inverse A ae 


Definition 12 An operator A : X — X is non-singular if it has an inverse 
in B(X, X); otherwise A is singular. 


In other words, if a transformation A : X —> X is non-singular there exists 
a transformation A~+ such that AA~! = J. This extends the concept of 
singularity from matrices to arbitrary operators. 


Fundamental Subspaces of an Operator 
Discusses the two fundamental subspaces of a linear operator - range and 
nullspace - and the rank of an operator. 


Definition 1 Let X be a finite dimensional Hilbert Space with orthonormal 
basis {b1, b2,...,6,} and A: X — Y bea linear operator. The range of A 
is the subspace 

Equation: 


& (A) = {y EY: y= Az € Xx} = [{ Abi, Abo, eb, tI. 


It is easy to see that Z(A) is a subspace of Y. To show the second equality 
above, note that if x € X then it can be written as 7 = Sa aj;b;; 
therefore, 

Equation: 


Az=A b> ot = S A (a;b;) = Sa: Ab; = s aii, 
= =1 =I =I 


where w; = Ab;. Before we can claim that {7;} is a basis for #(A), we 
must first show its elements are linearly independent. 


Lemma 1 If X be n-dimensional and A : X — Y, then &(A) has 
dimension less than for equal to n. 


If the set {w1, Wa, .--, Wn} given above is linearly independent then it is a 
basis for Z(A), and the dimension of #( A) is n. If they are not linearly 
independent we must show that dim(#(A)) < n. If {1, W2,..., Wn} are 
linearly dependent then there exists a set of scalars {a 1, @, ...,@,} such 
that }>""_, a;w; = 0 with at leas one nonzero a;; we let that be a without 
loss of generality. We then have 

Equation: 


y= 7 Sats, 
i=2 


Oj 


and so span ({w1, ..., Wn}) = span ({to, ..., n}). If the set {wo, ..., dn} 
is linearly independent, then dim(#(A)) = n — 1 < n. Otherwise, iterate 
this procedure to show that dim(#(A)) <n — 1. 


Definition 2 The null space of a A is the set of all points x € X that map to 
zero: 


Equation: 


N(A) = {x € X: Ax = 0}. 


It is easy to see that .V(A) is a subspace of X. 

Lemma 2 An operator A is non-singular if and only if (A) = {0}. 
We can extend the concept of rank from matrices to operators. 
Definition 3 The rank of A is the dimension of @(A) = rank(A). 


Theorem 1 Let A : X — Y and X be an n-dimensional space. Then, 
Equation: 


rank(A) + dim(./(A)) =n. 


Let (A) have dimension m. Design an orthonormal basis e1,..., En 
such that €1,..., @m are an orthonormal basis for .”(A) C X. This 
implies that ~; = Ae; = 0, for? = 1,...,m. Therefore, 


& (A) = span ({pi}i_1) = span cs eee SO 
rank(A) = dim(&#(A)) < n—™m. 


Now we need to show that {7;}"__, ,, are linearly independent. We use 
contradiction by assuming that {7);}"__, ,1 are linearly dependent, which 


means that pee 41 biti = 0 for some scalars 6; that are not all zero, i.e., 
Equation: 


so we know >"... bie; € W (A). We also know that e; | e; ,1 F J, 
and soe; | WY (A) fort = m+1,...,n. This implies in turn that 


. b;e; | W (A) for all choices of {b;}. Because 


i=m+1 
iem+1 bie; € W (A), the only possibility is that \77"_,4 bie: = 0. 
However, the orthonormal basis vectors e; fort = m+ 1,...,n are linear 


independent, and so we must have that 6; = 0. This is a contradiction with 
our original assumption, implying that the vectors {w;};"__, ,1 are linearly 
independent. Therefore, this set of vectors is a basis for R(A) and 
rank(A) = dim(R(A)) = n — m. This in turn implies that 

rank(A) + dim(.W(A)) =n. 


Properties of Linear Operators 
Discusses basic properties and classifications of operators. 


We start with a simple but useful property. 


Lemma 1 If A € B(X, X) and (z, Ax) = 0 for all e € X, then A= 0. 


Let a € C, then for any xz, y € X, we have that x + ay € X. Therefore, we obtain 
Equation: 
0 = (z+ ay, A(x + ay)), 
= (x + ay, Ax + aAy), 
= (x, Ax) + G(x, Ay) + a(y, Az) + |al"(y, Ay). 


Since x,y € X, we have that (x, Ar) = (y, Ay) = 0; therefore, 

0 =a(a, Ay) + a(y, Az). 

If we set a = 1, then 0 = (a, Ay) + (y, Az). So in this case, (x, Ay) = —(y, Az). 
If we set a = 7, then 0 = —i(a, Ay) + i(y, Ax). So in this case, (x, Ay) = (y, Az). 


Thus, (x, Ay) = 0 for all x, y € X, which means Ay = 0 for all y € X. So we can come 
to the conclusion that A = 0. 


Solutions to Operator Equations 


Assume X and Y are two normed linear spaces and A € B(.X,Y) is a bounded linear 
operator. Now pick y € Y. Then we pose the question: does a solution  € X to the 
equation Ax = y exist? 


There are three possibilities: 


1. A unique solution exists; 
2. multiple solutions exist; or 
3. no solution exists. 


We consider these cases separately below. 


1. Unique solution: Assume z and «x, are two solutions to the equation. In this case we 
have Az = Ax,. So A(x — x) = 0. Therefore, x — 2 € VW (A). If the solution x 
is unique then we must have = x, and x — x; = 0. Therefore, .V(A) = {0}. 
Since the operator has a trivial null space, then A~! exists. Thus, the solution to the 
equation is given by & = A ly; 


2. Multiple solutions: In this case, we may prefer to pick a particular solution. Often, 
our goal is to find the solution with smallest norm (for example, to reduce power in a 
communication problem). Additionally, there is a closed-form expression for the 
minimum-norm solution to the equation Ax = y. Theorem 1 Let X, Y be Hilbert 
spaces, y € Y, and A € B(X,Y). Then Z is the solution to Az = y if and only if 
¢@ = A°z, where z € Y is the solution of AA’z = y. Let 2; bea solution to Ax = y. 
Then all other solutions can be written as = xz — u, where u € .V(A). We 
therefore search for the solution that achieves the minimum value of ||z1, — u|| over 
u € W(A), ie., the closest point to x; in-4(A). Assume @ is such a point; then, 

a, —% L W (A). Now, recall that (4 (A))* =B@ (A’), sor=27—-Uuck (A*) 
. Thus @ = A’z for some z € Y, and y = Az = AA ‘z. Note that if AA’ is 
invertible, then @ = A°z = A*(AA’) “ty. 

3. No solution: In this case, we may aim to find a solution that minimizes the mismatch 
between the two sides of the equation, i.e., || y — Az ||. This is the well-known 
projection problem of y into #(A). Theorem 2 Let X, Y be Hilbert spaces, y € Y, 
and A € B(X,Y). The vector # minimizes ||z — Ay|| if and only if A AZ = A’y. 
Denote u = Az so that u € &(A). We need to find the minimum value of ||y — ul| 
over u € &(A). Assume 7 is the closest point to y in @(A), then y — tu 1 #(A), 
which means y — @ € (#(A))~. Recall that (#(A))” = VY (A*), which implies 
that y-— WE NV (A’). So A’ (y — at) = 0. Thus A*y = A‘@. Now, denoting 
u = AZ, we have that Ay — A’ A®. Note that if A A is invertible, then we have 


* 


@=(A*A) A’y 


Unitary Operator 
Definition 1 An operator A € B(X, X) is said to be unitary if AA’ = A’A = I. 
This implies that AY = A~!. Unitary operators have norm-preservation properties. 


Theorem 3A € B(X, X) is unitary if and only if (A) = X and ||Az|| = ||x|| for all 
ae, @ 


Let A be unitary, then for any x € X, 
Equation: 


ell = (@2) = y(@, Aa’2) = (de, Az) = ||As 


Since I: X + X and AA’: #(A) > #(A), then X = &(A). So if A € B(X, X) is 
unitary, then #(A) = X and ||Az|| = ||x|| for all « € X. From [link] we can find that 
Equation: 


0 =|la|? —||Az|? = (a,x) — (Ax, Ax) = (a, 2) — (2, A’ Az) = (2,2 7 A’ Az ). 


Since this is true for all 2 € X we have that 2 — A* Ax = 0 for all x € X, which means 
that x = A Az for all x € X. Therfore, we must have A’A = J. Additionally, since the 
operator is unitary, we find that for any x, y € X we have that 

|| Az — Ay|| = || A(a — y)|| = ||z — y||. So z = yif and only if Ax = Ay, implying 
that A is one-to-one. Since #(A) = X, then A is onto as well. Thus, A is invertible, 
which means A~!A = I. So A~!A = I = A’A, and therefore A~! = A’. The result is 
that A is unitary. We have shown that if (A) = X and ||Az|| = ||x|| for all z € X, 
then A € B(X, X) is unitary. 


Corollary 4 If X is finite-dimensional, then A is unitary if and only if || Az|| = ||z|| for 
alle UX: 


Introduction to vector spaces 
This module provides a brief review of some of the key concepts in vector 
spaces that will be required in developing the theory of compressive sensing. 


For much of its history, signal processing has focused on signals produced by 
physical systems. Many natural and man-made systems can be modeled as 
linear. Thus, it is natural to consider signal models that complement this kind of 
linear structure. This notion has been incorporated into modern signal 
processing by modeling signals as vectors living in an appropriate vector space. 
This captures the linear structure that we often desire, namely that if we add two 
signals together then we obtain a new, physically meaningful signal. Moreover, 
vector spaces allow us to apply intuitions and tools from geometry in R?, such 
as lengths, distances, and angles, to describe and compare signals of interest. 
This is useful even when our signals live in high-dimensional or infinite- 
dimensional spaces. 


Throughout this course, we will treat signals as real-valued functions having 
domains that are either continuous or discrete, and either infinite or finite. These 
assumptions will be made clear as necessary in each chapter. In this course, we 
will assume that the reader is relatively comfortable with the key concepts in 
vector spaces. We now provide only a brief review of some of the key concepts 
in vector spaces that will be required in developing the theory of compressive 
sensing (CS). For a more thorough review of vector spaces see this introductory 
course in Digital Signal Processing. 


We will typically be concerned with normed vector spaces, i.e., vector spaces 
endowed with a norm. In the case of a discrete, finite domain, we can view our 
signals as vectors in an N-dimensional Euclidean space, denoted by R%. When 
dealing with vectors in R%, we will make frequent use of the £, norms, which 
are defined for p € [1, co] as 

Equation: 


3 |B 


(Mi lel’), € [1,00); 


ee Eee p = oo. 


zl, = 


In Euclidean space we can also consider the standard inner product in RN, 
which we denote 


Equation: 


N 
(x,z) = 27a = S° nae ae 
i=l 


This inner product leads to the £2 norm: ||a||, = «/ (a, 2). 


In some contexts it is useful to extend the notion of £, norms to the case where 
p < 1. In this case, the “norm” defined in [link] fails to satisfy the triangle 
inequality, so it is actually a quasinorm. We will also make frequent use of the 
notation ||z||) := |supp (x)|, where supp (x) = {i : x; A 0} denotes the 
support of x and |supp(x)| denotes the cardinality of supp(x). Note that ||-||, 
is not even a quasinorm, but one can easily show that 

Equation: 


2) Lim lla? = |supp (2)], 


justifying this choice of notation. The £, (quasi-)norms have notably different 
properties for different values of p. To illustrate this, in [link] we show the unit 


sphere, i.e., {e : ||z||,=1 h induced by each of these norms in R?. Note that 


for p < 1 the corresponding unit sphere is nonconvex (reflecting the 
quasinorm's violation of the triangle inequality). 


Unit sphere Unit sphere Unit sphere Unit sphere for 
for 2; norm for £9 norm for 2,. norm £, quasinorm 
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Unit spheres in IR? for the £, norms with p = 1, 2, oo, and for the @, 


a. 


quasinorm with p = =. 


We typically use norms as a measure of the strength of a signal, or the size of an 
error. For example, suppose we are given a signal x € R? and wish to 
approximate it using a point in a one-dimensional affine space A. If we measure 
the approximation error using an £, norm, then our task is to find the  € A that 
minimizes ||x — Z]|,,. The choice of p will have a significant effect on the 
properties of the resulting approximation error. An example is illustrated in 
[link]. To compute the closest point in A to x using each @, norm, we can 
imagine growing an £, sphere centered on z until it intersects with A. This will 
be the point & € A that is closest to x in the corresponding £, norm. We observe 
that larger p tends to spread out the error more evenly among the two 
coefficients, while smaller p leads to an error that is more unevenly distributed 
and tends to be sparse. This intuition generalizes to higher dimensions, and 
plays an important role in the development of CS theory. 


Approximation Approximation Approximation Approximation 
in £; norm in £9 norm in £,., norm ie 
quasinorm 


Best approximation of a point in R? by a a one-dimensional subspace using 


the £, norms for p = 1, 2, 00, and the £, quasinorm with p = S. 


Bases and frames 
This module provides an overview of bases and frames in finite- 
dimensional Hilbert spaces. 


A set ¥ = {y;},— 9 is called a basis for a finite-dimensional vector space 
Y if the vectors in the set span ¥ and are linearly independent. This implies 
that each vector in the space can be represented as a linear combination of 
this (smaller, except in the trivial case) set of basis vectors in a unique 
fashion. Furthermore, the coefficients of this linear combination can be 
found by the inner product of the signal and a dual set of vectors. In discrete 


settings, we will only consider real finite-dimensional Hilbert spaces where 
P= RP and. F S41 eg IVY 


Mathematically, any signal z € R% may be expressed as, 


Equation: 
c= > arbi, 
1c I 


where our coefficients are computed as a; = (x, W;) and {ai} are the 
icl 

vectors that constitute our dual basis. Another way to denote our basis and 

its dual is by how they operate on z. Here, we call our dual basis Y our 

synthesis basis (used to reconstruct our signal by [link]) and W is our 

analysis basis. 


An orthonormal basis (ONB) is defined as a set of vectors ¥ = {wi}, 9 
that form a basis and whose elements are orthogonal and unit norm. In other 
words, (%;, ~;) = Oif i A 7 and one otherwise. In the case of an ONB, the 


synthesis basis is simply the Hermitian adjoint of analysis basis (w ="), 


It is often useful to generalize the concept of a basis to allow for sets of 
possibly linearly dependent vectors, resulting in what is known as a frame. 
More formally, a frame is a set of vectors {¥;}"_, inR%,d<n 
corresponding to a matrix ¥ € R2*”, such that for all vectors « € R42, 
Equation: 


2 2 2 
Alla], < |¥* al], < Blla|); 


with 0 < A < B < oo. Note that the condition A > 0 implies that the 
rows of W must be linearly independent. When A is chosen as the largest 
possible value and B as the smallest for these inequalities to hold, then we 
call them the (optimal) frame bounds. If A and B can be chosen as A = B, 
then the frame is called A-tight, and if A = B = 1, then W is a Parseval 
frame. A frame is called equal-norm, if there exists some A > 0 such that 
||Willp = A for all 7 = 1, ..., N, and it is unit-norm if A = 1. Note also that 
while the concept of a frame is very general and can be defined in infinite- 
dimensional spaces, in the case where W is ad x N matrix A and B simply 
correspond to the smallest and largest eigenvalues of YWW7, respectively. 


Frames can provide richer representations of data due to their redundancy: 
for a given signal z, there exist infinitely many coefficient vectors a such 
that z = Wa. In particular, each choice of a dual frame W provides a 
different choice of a coefficient vector a. More formally, any frame 
satisfying 

Equation: 


py? — py? —] 


is called an (alternate) dual frame. The particular choice v= (ww) Wis 
referred to as the canonical dual frame. It is also known as the Moore- 
Penrose pseudoinverse. Note that since A > 0 requires ¥ to have linearly 


independent rows, we ensure that WwW is invertible, so that W is well- 
defined. Thus, one way to obtain a set of feasible coefficients is via 
Equation: 


ag = 07 (DT) 2. 


One can show that this sequence is the smallest coefficient sequence in £2 
norm, ie., ||a@qg||, < ||a||, for all a such that c = Wa. 


Finally, note that in the sparse approximation literature, it is also common 
for a basis or frame to be referred to as a dictionary or overcomplete 
dictionary respectively, with the dictionary elements being called atoms. 


Sparse representations 
This module provides an overview of sparsity and sparse representations, 
giving examples for both 1-D and 2-D signals. 


Transforming a signal to a new basis or frame may allow us to represent a 
signal more concisely. The resulting compression is useful for reducing data 
storage and data transmission, which can be quite expensive in some 
applications. Hence, one might wish to simply transmit the analysis 
coefficients obtained in our basis or frame expansion instead of its high- 
dimensional correlate. In cases where the number of non-zero coefficients is 
small, we say that we have a sparse representation. Sparse signal models 
allow us to achieve high rates of compression and in the case of 
compressive sensing, we may use the knowledge that our signal is sparse in 
a known basis or frame to recover our original signal from a small number 
of measurements. For sparse data, only the non-zero coefficients need to be 
stored or transmitted in many cases; the rest can be assumed to be zero). 


Mathematically, we say that a signal x is K-sparse when it has at most K 
nonzeros, i.e., ||x||) < A. We let 
Equation: 


Ux = {x: ||a\|) < K} 


denote the set of all K’-sparse signals. Typically, we will be dealing with 
signals that are not themselves sparse, but which admit a sparse 
representation in some basis W. In this case we will still refer to x as being 
-sparse, with the understanding that we can express x as x = Wa where 
| @ llb < K. 


Sparsity has long been exploited in signal processing and approximation 
theory for tasks such as compression [link], [link], [link] and 

denoising [link], and in statistics and learning theory as a method for 
avoiding overfitting [link]. Sparsity also figures prominently in the theory 
of statistical estimation and model selection [link], [link], in the study of the 
human visual system [link], and has been exploited heavily in image 
processing tasks, since the multiscale wavelet transform [link] provides 


nearly sparse representations for natural images. Below, we briefly describe 
some one-dimensional (1-D) and two-dimensional (2-D) examples. 


1-D signal models 


We will now present an example of three basis expansions that yield 
different levels of sparsity for the same signal. A simple periodic signal is 
sampled and represented as a periodic train of weighted impulses (see 
[link]). One can interpret sampling as a basis expansion where our elements 
in our basis are impulses placed at periodic points along the time axis. We 
know that in this case, our dual basis consists of sinc functions used to 
reconstruct our signal from discrete-time samples. This representation 
contains many non-zero coefficients, and due to the signal's periodicity, 
there are many redundant measurements. Representing the signal in the 
Fourier basis, on the other hand, requires only two non-zero basis vectors, 
scaled appropriately at the positive and negative frequencies (see [link]). 
Driving the number of coefficients needed even lower, we may apply the 
discrete cosine transform (DCT) to our signal, thereby requiring only a 
single non-zero coefficient in our expansion (see [link]). The DCT equation 
is Ap = we. Ln, COS (F(n + + )k) with k = 0,---,.N—1, x, the 
input signal, and N the length of the transform. 


Time-Domain Signal CT Basis 
FourerBasis = ttttsti‘x 


Cosine signal in three representations: (a) Train of impulses (b) 
Fourier basis (c) DCT basis 


2-D signal models 


This same concept can be extended to 2-D signals as well. For instance, a 
binary picture of a nighttime sky is sparse in the standard pixel domain 
because most of the pixels are zero-valued black pixels. Likewise, natural 
images are characterized by large smooth or textured regions and relatively 
few sharp edges. Signals with this structure are known to be very nearly 
Sparse when represented using a multiscale wavelet transform [link]. The 
wavelet transform consists of recursively dividing the image into its low- 
and high-frequency components. The lowest frequency components provide 
a coarse scale approximation of the image, while the higher frequency 
components fill in the detail and resolve edges. What we see when we 
compute a wavelet transform of a typical natural image, as shown in [link], 
is that most coefficients are very small. Hence, we can obtain a good 
approximation of the signal by setting the small coefficients to zero, or 
thresholding the coefficients, to obtain a K’-sparse representation. When 
measuring the approximation error using an £, norm, this procedure yields 
the best K-term approximation of the original signal, i.e., the best 
approximation of the signal using only K basis elements.[footnote | 
Thresholding yields the best K-term approximation of a signal with respect 
to an orthonormal basis. When redundant frames are used, we must rely on 
sparse approximation algorithms like those described later in this 

course [link], [link]. 


Sparse representation of an image via a multiscale wavelet transform. 
(a) Original image. (b) Wavelet representation. Large coefficients are 
represented by light pixels, while small coefficients are represented by 
dark pixels. Observe that most of the wavelet coefficients are close to 


Zero. 


Sparsity results through this decomposition because in most natural images 
most pixel values vary little from their neighbors. Areas with little contrast 
difference can be represent with low frequency wavelets. Low frequency 
wavelets are created through stretching a mother wavelet and thus 
expanding it in space. On the other hand, discontinuities, or edges in the 
picture, require high frequency wavelets, which are created through 
compacting a mother wavelet. At the same time, the transitions are 
generally limited in space, mimicking the properties of the high frequency 
compacted wavelet. See "Compressible signals” for an example. 


Compressible signals 
This module describes compressible signals, i.e., signals that can be well- 
approximated by sparse signals. 


Compressibility and K-term approximation 


An important assumption used in the context of compressive sensing (CS) 
is that signals exhibit a degree of structure. So far the only structure we 
have considered is sparsity, i.e., the number of non-zero values the signal 
has when representation in an orthonormal basis YW. The signal is considered 
sparse if it has only a few nonzero values in comparison with its overall 
length. 


Few structured signals are truly sparse; rather they are compressible. A 
signal is compressible if its sorted coefficient magnitudes in W decay 
rapidly. To consider this mathematically, let x be a signal which is 
compressible in the basis W: 

Equation: 


x = Wa, 


where q are the coefficients of x in the basis W. If z is compressible, then 
the magnitudes of the sorted coefficients a, observe a power law decay: 
Equation: 


a(S Os yo 125 c34 
We define a signal as being compressible if it obeys this power law decay. 


The larger q is, the faster the magnitudes decay, and the more compressible 
a signal is. [link] shows images that are compressible in different bases. 


The image in the upper left is a signal that is compressible in space. 
When the pixel values are sorted from largest to smallest, there is a 
sharp descent. The image in the lower left is not compressible in 
space, but it is compressible in wavelets since its wavelet coefficients 
exhibit a power law decay. 


Because the magnitudes of their coefficients decay so rapidly, compressible 
signals can be represented well by K < WN coefficients. The best A-term 
approximation of a signal is the one in which the K largest coefficients are 
kept, with the rest being zero. The error between the true signal and its K 
term approximation is denoted the K-term approximation error ox (x), 
defined as 

Equation: 


ox (x) =arg min || x — Wor |lp. 


For compressible signals, we can establish a bound with power law decay 
as follows: 
Equation: 


OK (x) aS Coe 


In fact, one can show that ox (x), will decay as K” if and only if the 
sorted coefficients a; decay as i~’+!/? [link]. [link] shows an image and its 
K-term approximation. 


Sparse approximation of a natural image. (a) Original image. 

(b) Approximation of image obtained by keeping only the largest 10% 
of the wavelet coefficients. Because natural images are compressible 
in a wavelet domain, approximating this image it in terms of its largest 
wavelet coefficients maintains good fidelity. 


Compressibility and £, spaces 


A signal's compressibility is related to the £, space to which the signal 
belongs. An infinite sequence z(7) is an element of an £, space for a 
particular value of p if and only if its 2, norm is finite: 

Equation: 


1 


||, (>: =?) ee 


t 


The smaller p is, the faster the sequence's values must decay in order to 
converge so that the norm is bounded. In the limiting case of p = 0, the 
“norm” is actually a pseudo-norm and counts the number of non-zero 
values. As p decreases, the size of its corresponding £, space also decreases. 
[link] shows various £, unit balls (all sequences whose £, norm is 1) in 3 
dimensions. 


Boe++ 


QO<p<l p=O0 


As the value of p decreases, the size of the corresponding £, space also 
decreases. This can be seen visually when comparing the the size of 
the spaces of signals, in three dimensions, for which the £, norm is less 
than or equal to one. The volume of these £, “balls” decreases with p. 


Suppose that a signal is sampled infinitely finely, and call it z[n]. In order 
for this sequence to have a bounded @, norm, its coefficients must have a 
power-law rate of decay with g > 1/p. Therefore a signal which is in an £, 
space with p < 1 obeys a power law decay, and is therefore compressible. 


Sensing matrix design 
This module provides an overview of the sensing matrix design problem in 
compressive sensing. 


In order to make the discussion more concrete, we will restrict our attention 
to the standard finite-dimensional compressive sensing (CS) model. 
Specifically, given a signal x € R%, we consider measurement systems that 
acquire MM linear measurements. We can represent this process 
mathematically as 

Equation: 


y= oz, 


where @ is an M x N matrix and y € R™. The matrix & represents a 
dimensionality reduction, i.e., it maps R, where N is generally large, into 
R™, where M is typically much smaller than N. Note that in the standard 
CS framework we assume that the measurements are non-adaptive, 
meaning that the rows of @ are fixed in advance and do not depend on the 
previously acquired measurements. In certain settings adaptive 
measurement schemes can lead to significant performance gains. 


Note that although the standard CS framework assumes that z is a finite- 
length vector with a discrete-valued index (such as time or space), in 
practice we will often be interested in designing measurement systems for 
acquiring continuously-indexed signals such as continuous-time signals or 
images. For now we will simply think of z as a finite-length window of 
Nyquist-rate samples, and we temporarily ignore the issue of how to 
directly acquire compressive measurements without first sampling at the 
Nyquist rate. 


There are two main theoretical questions in CS. First, how should we 
design the sensing matrix @ to ensure that it preserves the information in 
the signal x? Second, how can we recover the original signal x from 
measurements y? In the case where our data is sparse or compressible, we 
will see that we can design matrices with MZ < N that ensure that we 
will be able to recover the original signal accurately and efficiently using a 
variety of practical algorithms. 


We begin in this part of the course by first addressing the question of how 
to design the sensing matrix @. Rather than directly proposing a design 
procedure, we instead consider a number of desirable properties that we 
might wish @ to have (including the null space property, the restricted 


important examples of matrix constructions that satisfy these properties. 


Null space conditions 

This module introduces the spark and the null space property, two common 
conditions related to the null space of a measurement matrix that ensure the 
success of sparse recovery algorithms. Furthermore, the null space property is 
shown to be a necessary condition for instance optimal or uniform recovery 
guarantees. 


A natural place to begin in establishing conditions on @ in the context of 
designing a sensing matrix is by considering the null space of @, denoted 
Equation: 


N(@) = {z: &z = 0}. 


If we wish to be able to recover all sparse signals x from the measurements 
@zx, then it is immediately clear that for any pair of distinct vectors 

zz € Le = {z: ||z||) < A}, we must have $x # 2’, since otherwise it 
would be impossible to distinguish x from 2’ based solely on the 
measurements y. More formally, by observing that if 6x = dz’ then 

P(x — x') = Owith x — 2’ € Sox, we see that uniquely represents all 

x € S’x if and only if (®) contains no vectors in 372%. There are many 
equivalent ways of characterizing this property; one of the most common is 
known as the spark [link]. 


The spark 


The spark of a given matrix @ is the smallest number of columns of & 
that are linearly dependent. 


This definition allows us to pose the following straightforward guarantee. 
(Corollary 1 of [link]) 


For any vector y € R™, there exists at most one signal x € ¥/x such that 
y = &z if and only if spark(@) > 2K. 


We first assume that, for any y € R™, there exists at most one signal x € Xx 
such that y = x. Now suppose for the sake of a contradiction that 
spark(®@) < 2K. This means that there exists some set of at most 2K 


columns that are linearly dependent, which in turn implies that there exists an 
h € W(®@) such that h € Xx. In this case, since h € Xo we can write 

h = «a — a’, where x, x’ € S'x. Thus, since h € W(®) we have that 

&(x — x’) = Oand hence $x = &z’. But this contradicts our assumption that 
there exists at most one signal x € 2x such that y = #zx. Therefore, we must 
have that spark(@) > 2K. 


Now suppose that spark(®) > 2K. Assume that for some y there exist 

x, a’ € S'x such that y = x = Hx’. We therefore have that (x — zx’) = 0. 
Letting h = x — 2’, we can write this as 6h = 0. Since spark(@) > 2K, all 
sets of up to 2K columns of @ are linearly independent, and therefore h = 0. 
This in turn implies x = 2’, proving the theorem. 


It is easy to see that spark(®) € [2, M + 1]. Therefore, [link] yields the 
requirement M > 2K. 


The null space property 


When dealing with exactly sparse vectors, the spark provides a complete 
characterization of when sparse recovery is possible. However, when dealing 
restrictive conditions on the null space of # [link]. Roughly speaking, we 
must also ensure that .”(®) does not contain any vectors that are too 
compressible in addition to vectors that are sparse. In order to state the formal 
definition we define the following notation that will prove to be useful 
throughout much of this course. Suppose that A C {1,2,---,N} is a subset 
of indices and let A° = {1,2,---,N}\ A. By zy we typically mean the 
length N vector obtained by setting the entries of x indexed by A‘ to zero. 
Similarly, by ®, we typically mean the M x N matrix obtained by setting 
the columns of @ indexed by A‘ to zero.[footnote] 

We note that this notation will occasionally be abused to refer to the length 
|A| vector obtained by keeping only the entries corresponding to A or the 

M x |A| matrix obtained by only keeping the columns corresponding to A. 
The usage should be clear from the context, but typically there is no 
substantive difference between the two. 


A matrix @ satisfies the null space property (NSP) of order K if there 
exists a constant C’ > 0 such that, 
Equation: 


Rally SC 


IPaclly 
K 


holds for all h € -W(@) and for all A such that |A| < K. 


The NSP quantifies the notion that vectors in the null space of @ should not 
be too concentrated on a small subset of indices. For example, if a vector h is 
exactly K-sparse, then there exists a A such that ||A4¢||, = 0 and hence [link] 
implies that hy = 0 as well. Thus, if a matrix @ satisfies the NSP then the 
only K-sparse vector in .W(@) is h = 0. 


To fully illustrate the implications of the NSP in the context of sparse 
recovery, we now briefly discuss how we will measure the performance of 
sparse recovery algorithms when dealing with general non-sparse z. Towards 
this end, let A : R™ — R% represent our specific recovery method. We will 
focus primarily on guarantees of the form 

Equation: 


OK(x), 


VK 


| A(z) — a\|,<C 


for all x, where we recall that 
Equation: 


ox(2),=min |x ~ 2l, 
K 


This guarantees exact recovery of all possible K-sparse signals, but also 
ensures a degree of robustness to non-sparse signals that directly depends on 
how well the signals are approximated by K-sparse vectors. Such guarantees 
are called instance-optimal since they guarantee optimal performance for each 
instance of x [link]. This distinguishes them from guarantees that only hold 


for some subset of possible signals, such as sparse or compressible signals — 

the quality of the guarantee adapts to the particular choice of x. These are also 
commonly referred to as uniform guarantees since they hold uniformly for all 
2. 


Our choice of norms in [link] is somewhat arbitrary. We could easily measure 
the reconstruction error using other £, norms. The choice of p, however, will 
limit what kinds of guarantees are possible, and will also potentially lead to 
alternative formulations of the NSP. See, for instance, [link]. Moreover, the 
form of the right-hand-side of [link] might seem somewhat unusual in that we 
measure the approximation error as 0x(2),//K rather than simply 
something like ox(x),. However, we will see later in this course that such a 
guarantee is actually not possible without taking a prohibitively large number 
of measurements, and that [link] represents the best possible guarantee we can 
hope to obtain (see "Instance-optimal guarantees revisited"). 


Later in this course, we will show that the NSP of order 2 is sufficient to 
establish a guarantee of the form [link] for a practical recovery algorithm (see 
"Noise-free signal recovery"). Moreover, the following adaptation of a 
theorem in [link] demonstrates that if there exists any recovery algorithm 
satisfying [link], then must necessarily satisfy the NSP of order 2K. 
(Theorem 3.2 of [link]) 


Let & : R™ — R” denote a sensing matrix and A : R™ — R denote an 
arbitrary recovery algorithm. If the pair (®, A) satisfies [link] then @ satisfies 
the NSP of order 2K. 


Suppose h € .V(®) and let A be the indices corresponding to the 2K largest 
entries of h. We next split A into Ag and Aj, where | Ag |=|A,|= K. Set 

xz =hy, +h, and x’ = —hy,, so that h = x — z’. Since by construction 

x’ € Sx, we can apply [link] to obtain x’ = A(z’). Moreover, since 

h € W(®@), we have 

Equation: 


fh = &(x — 2') =0 


so that dx’ = Sx. Thus, x’ = A(z). Finally, we have that 


Equation: 


I eacll 
V2K ” 


OK(z) 
Palle < Allg = lle — 2'||, = |lx — A(Pz) ||. < C y, t= 4/90 


where the last inequality follows from [link]. 


The restricted isometry property 

In this module we introduce the restricted isometry property (RIP) and discuss its 
role in compressive sensing. In particular, we describe the relationship between the 
RIP and the concept of stability in the context of sparse signal acquisition. We also 
provide a simple lower bound on the number of measurements necessary for a 
matrix to satisfy the RIP. 


The null space property (NSP) is both necessary and sufficient for establishing 


guarantees of the form 
Equation: 


ox(x), 


A(x) - al], << C4 aa 


but these guarantees do not account for noise. When the measurements are 
contaminated with noise or have been corrupted by some error such as 
quantization, it will be useful to consider somewhat stronger conditions. In [link], 
Candés and Tao introduced the following isometry condition on matrices @ and 
established its important role in compressive sensing (CS). 


A matrix © satisfies the restricted isometry property (RIP) of order K if there 
exists ad € (0, 1) such that 
Equation: 


(1 —dx)|lellz < ||Pell) < (1 + 6x)|lell, 


holds for all € X'~ = {x : ||z||) < K}. 


If a matrix @ satisfies the RIP of order 2K, then we can interpret [link] as saying 
that approximately preserves the distance between any pair of K-sparse vectors. 
This will clearly have fundamental implications concerning robustness to noise. 


It is important to note that in our definition of the RIP we assume bounds that are 
symmetric about 1, but this is merely for notational convenience. In practice, one 
could instead consider arbitrary bounds 

Equation: 


2 2 2 
allal|, < Px]. < All| 


where 0 < a < 6 < oo. Given any such bounds, one can always scale & so that it 
satisfies the symmetric bound about 1 in [link]. Specifically, multiplying @ by 
\/2/(8 + q) will result in an & that satisfies [link] with constant 

dx = (G—a)/ (8+ a). We will not explicitly show this, but one can check that 
all of the theorems in this course based on the assumption that @ satisfies the RIP 
actually hold as long as there exists some scaling of & that satisfies the RIP. Thus, 
since we can always scale @ to satisfy [link], we lose nothing by restricting our 
attention to this simpler bound. 


Note also that if ® satisfies the RIP of order K with constant 6x, then for any 

K' < K we automatically have that @ satisfies the RIP of order K' with constant 
On: < 6x. Moreover, in [link] it is shown that if ® satisfies the RIP of order K 
with a sufficiently small constant, then it will also automatically satisfy the RIP of 
order yK for certain , albeit with a somewhat worse constant. 

(Corollary 3.4 of [link]) 


Suppose that & satisfies the RIP of order K with constant 6x. Let y be a positive 
integer. Then @ satisfies the RIP of order K’ = +y | = | with constant dx: < y- Ox, 
where |-| denotes the floor operator. 


This lemma is trivial for y = 1, 2, but for y > 3 (and K > 4) this allows us to 
extend from RIP of order K to higher orders. Note however, that 6% must be 
sufficiently small in order for the resulting bound to be useful. 


The RIP and stability 


We will see later in this course that if a matrix @ satisfies the RIP, then this is 
sufficient for a variety of algorithms to be able to successfully recover a sparse 
signal from noisy measurements. First, however, we will take a closer look at 
whether the RIP is actually necessary. It should be clear that the lower bound in the 
RIP is a necessary condition if we wish to be able to recover all sparse signals x 
from the measurements @z for the same reasons that the NSP is necessary. We can 
say even more about the necessity of the RIP by considering the following notion 
of stability. 


Let & : R“ — R™ denote a sensing matrix and A : R” — RY denote a 
recovery algorithm. We say that the pair (®, A) is C-stable if for any x € XK 
and any e € R™ we have that 

Equation: 


|| A(z + e) — al], < Cllel). 


This definition simply says that if we add a small amount of noise to the 
measurements, then the impact of this on the recovered signal should not be 
arbitrarily large. [link] below demonstrates that the existence of any decoding 
algorithm (potentially impractical) that can stably recover from noisy 
measurements requires that & satisfy the lower bound of [link] with a constant 
determined by C. 


If the pair (@, A) is C-stable, then 
Equation: 


1 
Glltlle = fell, 


forall a € Mor. 


Pick any z, z € 2x. Define 


Equation: 
&(z— 2) P(x — z) 
Cx = a and ez = 9”? 
and note that 
Equation: 
P(x + z) 


@x +e, = @$z+e,= 5 


Let = A($x+e,) = A(Sz+ e,). From the triangle inequality and the 
definition of C-stability, we have that 
Equation: 


|e — 2p =|le-@+e—a2ll 


< |v — 2]. + [2 — zl, 
< Clles|| + Cllezlls 
= C||z — $2z||,. 


Since this holds for any x, z € 2’x, the result follows. 


Note that as C’ + 1, we have that & must satisfy the lower bound of [link] with 
dx =1-1/C 2 _+ 0. Thus, if we desire to reduce the impact of noise in our 
recovered signal then we must adjust @ so that it satisfies the lower bound of [link] 
with a tighter constant. 


One might respond to this result by arguing that since the upper bound is not 
necessary, we can avoid redesigning @ simply by rescaling @ so that as long as & 
satisfies the RIP with d2x < 1, the rescaled version a@ will satisfy [link] for any 
constant C’. In settings where the size of the noise is independent of our choice of & 
, this is a valid point — by scaling @ we are simply adjusting the gain on the 
“signal” part of our measurements, and if increasing this gain does not impact the 
noise, then we can achieve arbitrarily high signal-to-noise ratios, so that eventually 
the noise is negligible compared to the signal. 


However, in practice we will typically not be able to rescale & to be arbitrarily 
large. Moreover, in many practical settings the noise is not independent of ®. For 
example, suppose that the noise vector e represents quantization noise produced by 
a finite dynamic range quantizer with B bits. Suppose the measurements lie in the 
interval |—T, T], and we have adjusted the quantizer to capture this range. If we 
rescale & by a, then the measurements now lie between [—aT, aT], and we must 
scale the dynamic range of our quantizer by a. In this case the resulting 
quantization error is simply ae, and we have achieved no reduction in the 
reconstruction error. 


Measurement bounds 


We can also consider how many measurements are necessary to achieve the RIP. If 
we ignore the impact of 6 and focus only on the dimensions of the problem (N, M, 
and £’) then we can establish a simple lower bound. We first provide a preliminary 
lemma that we will need in the proof of the main theorem. 


Let K and NN satisfying kK < N/2 be given. There exists a set X C x such that 
for any x € X we have ||a||, < VK and for any 2, z € X with « F z, 
Equation: 


lx — all, > of K/2, 


and 
Equation: 


K N 
> — — |. 
log |X| > 5 toe (+) 


We will begin by considering the set 
Equation: 


v= {2 € {0,+1,—-1}" : lal], = K}. 
By construction, ||a||5 = K for all x € U. Thus if we construct X by picking 
elements from U then we automatically have ||2||, << /K. 


Next, observe that |U| = Care Note also that || — z||) < ||z — z\|, and thus 


if |x — z||5 < K/2 then ||x — z||) < A/2. From this we observe that for any 
fixed 7 =U, 
Equation: 


{2 EU: |x —2||2 < K/2}| <|{zeU: |e —zl|, < K/2}| < (cia) 


Thus, suppose we construct the set X by iteratively choosing points that satisfy 
[link]. After adding 7 points to the set, there are at least 


Equation: 
K Kk /2 


points left to pick from. Thus, we can construct a set of size |X| provided that 


Equation: 
N K/2 N\ 3K 
x a 2 
IC se)?" < (Fe) 


Next, observe that 
Equation: 


(x) aamv—K)y BP WK +i (z i) 


e) —  K\(N—K)! ey (K/2+% 


where the inequality follows from the fact that (n — K + 2)/(K/2 + 7“) is 
decreasing as a function of 7. Thus, if we set |X| = (V/K) K/2 then we have 
Equation: 


x 3 ae 3N ae N oN oe Nol K/2 
4 AK  \K AK “\k 2 


Hence, [link] holds for |X] = (N/K a * which establishes the lemma. 


IA 


Using this lemma, we can establish the following bound on the required number of 
measurements to satisfy the RIP. 


Let be an M x N matrix that satisfies the RIP of order 2K with constant 
O6€ (0, 5 |. Then 
Equation: 


M > CK log (z) 


where C’ = 1/2 log (v24 + 1) ~ 0. 28. 


We first note that since @ satisfies the RIP, then for the set of points X in [link] we 
have, 
Equation: 


| Sx — B2||, > V1 —d|le — ally > /K/4 


for all x, z € X, since x — z € Nox andd < +. Similarly, we also have 
Equation: 


Pall, < V1+d|lzlly < /3K/2 


forallz € X. 


From the lower bound we can say that for any pair of points 2, z € X, if we center 


balls of radius \/K/4/2 = \/K/16 at Sz and &z, then these balls will be 
disjoint. In turn, the upper bound tells us that the entire set of balls is itself 
contained within a larger ball of radius \/3K /2 + \/K/16. If we let 

BM (r) = {a € R™ : ||x||, <r}, this implies that 

Equation: 


vol( a" ( y/3x/24 vers)) > |X ia (lh alae 
(/3K/2+ y'K/6) > |x (x6) , 
(v24 + 1)" > |X], 


log |X| 


" log (v24+1) 


The theorem follows by applying the bound for |X| from [link]. 


Note that the restriction to 6 < + is arbitrary and is made merely for convenience 
— minor modifications to the argument establish bounds for 6 < dmax for any 
Omax < 1. Moreover, although we have made no effort to optimize the constants, it 
is worth noting that they are already quite reasonable. 


Although the proof is somewhat less direct, one can establish a similar result (in the 
dependence on N and K) by examining the Gelfand width of the @, ball [link]. 
However, both this result and [link] fail to capture the precise dependence of M/ on 
the desired RIP constant 6. In order to quantify this dependence, we can exploit 
recent results concerning the Johnson-Lindenstrauss lemma, which concerns 
embeddings of finite sets of points in low-dimensional spaces [link]. Specifically, it 


is shown in [link] that if we are given a point cloud with p points and wish to 
embed these points in R™ such that the squared £ distance between any pair of 
points is preserved up to a factor of 1 + e, then we must have that 

Equation: 


co log (p) 


M 2 2 y] 


€ 


where Cg > O is a constant. 


The Johnson-Lindenstrauss lemma is closely related to the RIP. We will see in 
"Matrices that satisfy the RIP" that any procedure that can be used for generating a 
linear, distance-preserving embedding for a point cloud can also be used to 
construct a matrix that satisfies the RIP. Moreover, in [link] it is shown that if a 
matrix @ satisfies the RIP of order K = c, log (p) with constant 6, then @ can be 
used to construct a distance-preserving embedding for p points with e = 6/4. 
Combining these we obtain 

Equation: 


colog(p)  16cy.K 


M> 
= €2 C102 


Thus, for small 6 the number of measurements required to ensure that @ satisfies 
the RIP of order K will be proportional to K /5?, which may be significantly 
higher than K log (V/Kc). See [link] for further details. 


The RIP and the NSP 

This module describes the relationship between the restricted isometry 
property (RIP) and the null space property (NSP). Specifically, it is shown 
that a matrix which satisfies the RIP will also satisfy the NSP. 


(RIP), then it also satisfies the null space property (NSP). Thus, the RIP is 
strictly stronger than the NSP. 


Suppose that © satisfies the RIP of order 2K with doxK < /2 —1. Then & 
satisfies the NSP of order 2/ with constant 
Equation: 


V 259K 


an 1— (1+ v2)dox 


The proof of this theorem involves two useful lemmas. The first of these 
follows directly from standard norm inequality by relating a K’-sparse 
vector to a vector in R“. We include a simple proof for the sake of 
completeness. 


Suppose u € 2x. Then 


Equation: 
lull a 
Tee lle S VE lela: 
For any u, ||u||, = |(u, sgn(u))|. By applying the Cauchy-Schwarz 


inequality we obtain ||u||, < ||w/|,||sgn(w)||,. The lower bound follows 
since sgn(w) has exactly K nonzero entries all equal to +1 (since u € 3’x) 
and thus ||sgn(w)|| = K. The upper bound is obtained by observing that 
each of the K nonzero entries of u can be upper bounded by ||u!| ... 


Below we state the second key lemma that we will need in order to prove 
[link]. This result is a general result which holds for arbitrary h, not just 
vectors h € W(@). It should be clear that when we do have h € .V(®), 
the argument could be simplified considerably. However, this lemma will 
prove immensely useful when we turn to the problem of sparse recovery 
from noisy measurements later in this course, and thus we establish it now 
in its full generality. We state the lemma here, which is proven in "4; 
minimization proof". 


Suppose that @ satisfies the RIP of order 2K, and let h € RX h + () be 
arbitrary. Let Ag be any subset of {1, 2,..., NW} such that |Ao|< K. Define 
A, as the index set corresponding to the K entries of h ac with largest 


magnitude, and set A = Ap U Ay. Then 


Equation: 
Lyra fh, Ph 
eaee eee 
VK RAlls 
where 
Equation: 
ones V 260K _ 1 
1 box’ — 1 -boK | 


Again, note that [link] holds for arbitrary h. In order to prove [link], we 
merely need to apply [link] to the case where h € V(®). 


Towards this end, suppose that h € .V(@). It is sufficient to show that 
Equation: 


one 


Ralls < © 


holds for the case where A is the index set corresponding to the 2K largest 
entries of h. Thus, we can take Apo to be the index set corresponding to the 
K largest entries of h and apply [link]. 


The second term in [link] vanishes since £h = 0, and thus we have 
Equation: 


I Pac|| 
hall, < a—S. 
VK 
Using [link], 
Equation: 
acl, = Wally + WRaelly < VK lPay lly + (Rael 


resulting in 
Equation: 


Weaelh 
Valle <a Ital + 2), 


Since |[Ra,||2 < ||Ra 
Equation: 


9, we have that 


hac 
(1—a)|[hallg < oa, 


The assumption 69% < /2 — 1 ensures that a < 1, and thus we may 
divide by 1 — a without changing the direction of the inequality to establish 
[link] with constant 

Equation: 


C= a - J/ 260K 


l-a 1— (1+ V2) box 


as desired. 


Matrices that satisfy the RIP 
This module provides some examples of matrices that satisfy the restricted 
isometry property (RIP), focusing primarily on random constructions. 


We now turn to the question of how to construct matrices that satisfy the 
construct matrices of size MZ x N that satisfy the RIP of order A, but such 
constructions also require // to be relatively large [link], [link]. For 
example, the construction in [link] requires M = O(K 2 log N ) while the 
construction in [link] requires M = O( KN“) for some constant a. In 
many real-world settings, these results would lead to an unacceptably large 
requirement on M. 


Fortunately, these limitations can be overcome by randomizing the matrix 
construction. We will construct our random matrices as follows: given M 
and NV, generate random matrices @ by choosing the entries y;; as 
independent realizations from some probability distribution. We begin by 
observing that if all we require is that 62x > 0 then we may set M = 2K 
and draw a @ according to a Gaussian distribution. With probability 1, any 
subset of 2K columns will be linearly independent, and hence all subsets of 
2K columns will be bounded below by 1 — 62x where 62x > 0. However, 
suppose we wish to know the constant 62x. In order to find the value of the 


constant we must consider all possible i) &-dimensional subspaces of 


IR. From a computational perspective, this is impossible for any realistic 
values of N and K. Thus, we focus on the problem of achieving the RIP of 
order 2K for a specified constant 62%. Our treatment is based on the simple 
approach first described in [link] and later generalized to a larger class of 
random matrices in [link]. 


To ensure that the matrix will satisfy the RIP, we will impose two 
conditions on the random distribution. First, we require that the distribution 
will yield a matrix that is norm-preserving, which will require that 
Equation: 


1 
E (¢;;) — M’ 


and hence the variance of the distribution is 1/M. Second, we require that 
the distribution is a sub-Gaussian distribution, meaning that there exists a 
constant c > 0 such that 

Equation: 


E(e*") cA ect /2 


for all t € IR. This says that the moment-generating function of our 
distribution is dominated by that of a Gaussian distribution, which is also 
equivalent to requiring that tails of our distribution decay at least as fast as 
the tails of a Gaussian distribution. Examples of sub-Gaussian distributions 
include the Gaussian distribution, the Bernoulli distribution taking values 
zal | VM, and more generally any distribution with bounded support. See 
"Sub-Gaussian random variables" for more details. 


For the moment, we will actually assume a bit more than sub-Gaussianity. 
Specifically, we will assume that the entries of @ are strictly sub-Gaussian, 


which means that they satisfy [link] with c? = E (¢?,) = a: Similar 


results to the following would hold for general sub-Gaussian distributions, 
but to simplify the constants we restrict our present attention to the strictly 
sub-Gaussian ®. In this case we have the following useful result, which is 
proven in "Concentration of measure for sub-Gaussian random variables". 


Suppose that @ is an M x N matrix whose entries ;; are i.i.d. with y;; 
drawn according to a strictly sub-Gaussian distribution with c? = 1 /M. Let 
Y = @z for x € RX. Then for any € > 0, and any xz € RY, 

Equation: 


E(I|¥ 13) =e Mh 


and 
Equation: 


Me? 
2 2 2 
P(|I¥ I-Ie [8] 2 ele 8) <2 ex (-2E) 


with x = 2/(1— log (2)) = 6.52. 


This tells us that the norm of a sub-Gaussian random vector strongly 
concentrates about its mean. Using this result, in "Proof of the RIP for sub- 
Gaussian matrices" we provide a simple proof based on that in [link] that 
sub-Gaussian matrices satisfy the RIP. 


Fix 6 € (0,1). Let & be an M x N random matrix whose entries y;; are 
i.i.d. with y;; drawn according to a strictly sub-Gaussian distribution with 
c? = 1/M.If 

Equation: 


N 
M > «1K log (Zz). 


then @ satisfies the RIP of order K with the prescribed 6 with probability 
exceeding 1 — 2e-*2 | where kK is arbitrary and 
Ko = 67/26 — log (42e/5)/K1. 


Note that in light of the measurement bounds in "The restricted isometry 
property" we see that [link] achieves the optimal number of measurements 
(up to a constant). 


Using random matrices to construct has a number of additional benefits. 
To illustrate these, we will focus on the RIP. First, one can show that for 
random constructions the measurements are democratic, meaning that it is 
possible to recover a signal using any sufficiently large subset of the 
measurements [link], [link]. Thus, by using random @ one can be robust to 
the loss or corruption of a small fraction of the measurements. Second, and 
perhaps more significantly, in practice we are often more interested in the 
setting where z is sparse with respect to some basis W. In this case what we 
actually require is that the product ®W satisfies the RIP. If we were to use a 


deterministic construction then we would need to explicitly take W into 
account in our construction of ®, but when @ is chosen randomly we can 
avoid this consideration. For example, if & is chosen according to a 
Gaussian distribution and W is an orthonormal basis then one can easily 
show that @W will also have a Gaussian distribution, and so provided that 

M is sufficiently high W will satisfy the RIP with high probability, just as 
before. Although less obvious, similar results hold for sub-Gaussian 
distributions as well [link]. This property, sometimes referred to as 
universality, constitutes a significant advantage of using random matrices to 
construct @. 


Finally, we note that since the fully random matrix approach is sometimes 
impractical to build in hardware, several hardware architectures have been 
implemented and/or proposed that enable random measurements to be 
acquired in practical settings. Examples include the random 

demodulator [link], random filtering [link], the modulated wideband 
converter [link], random convolution [link], [link], and the compressive 
multiplexer [link]. These architectures typically use a reduced amount of 
randomness and are modeled via matrices @ that have significantly more 
structure than a fully random matrix. Perhaps somewhat surprisingly, while 
it is typically not quite as easy as in the fully random case, one can prove 
that many of these constructions also satisfy the RIP. 


Proof of the RIP for sub-Gaussian matrices 
In this module we provide a proof that sub-Gaussian matrices satisfy the restricted 
isometry property. 


We now show how to exploit the concentration of measure properties of sub- 
Gaussian distributions to provide a simple proof that sub-Gaussian matrices satisfy 
constructing an M x N matrix @ at random with / sufficiently large, then with 
high probability there exists a dx € (0,1) such that 

Equation: 


(1 —dx)llallz < |Pelly < (1+ dx)llell3 


holds for all x € Xx (where »'x denotes the set of all signals x with at most K 
nonzeros). 


We begin by observing that if all we require is that 62% > 0, then we may set 

M = 2K and draw a @ according to a Gaussian distribution, or indeed any 
continuous univariate distribution. In this case, with probability 1, any subset of 2K 
columns will be linearly independent, and hence all subsets of 2/ columns will be 
bounded below by 1 — 69x where 62% > 0. However, suppose we wish to know the 
constant 62x. In order to find the value of the constant we must consider all possible 


(x) K-dimensional subspaces of R%. From a computational perspective, this is 


impossible for any realistic values of N and K. Moreover, in light of the lower 
bounds we described earlier in this course, the actual value of 62x in this case is 
likely to be very close to 1. Thus, we focus instead on the problem of achieving the 
RIP of order 2K for a specified constant dx. 


To ensure that the matrix will satisfy the RIP, we will impose two conditions on the 
random distribution. First, we require that the distribution is sub-Gaussian. In order 
to simplify our argument, we will use the simpler results stated in Corollary 2 from 
"Concentration of measure for sub-Gaussian random variables", which we briefly 
recall. 


Suppose that @ is an M x N matrix whose entries ¢;; are i.i.d. with 

yi; ~ SSub (1/M). Let Y = &z for x € R%. Then for any € > 0, and any 
x € RN : 

Equation: 


E(I¥ 13) =ll ell 


and 
Equation: 


P({Il ¥ 3 - ll @ [3] > el 13) <2exp (- Me) 


with « = 2/(1— log (2)) © 6.52. 


By exploiting this theorem, we assume that the distribution used to construct @ is 
strictly sub-Gaussian. This is done simply to yield more concrete constants. The 
argument could easily be modified to establish a similar result for general sub- 
Gaussian distributions by instead using Theorem 2 from "Concentration of measure 
for sub-Gaussian random variables". 


Our second condition is that we require that the distribution yield a matrix that is 
approximately norm-preserving, which will require that 
Equation: 


| 
E (y3,) = M”’ 


and hence the variance is 1/M. 


We shall now show how the concentration of measure inequality in [link] can be 
used together with covering arguments to prove the RIP for sub-Gaussian random 
matrices. Our general approach will be to construct nets of points in each K- 
dimensional subspace, apply [link] to all of these points through a union bound, and 
then extend the result from our finite set of points to all possible A -dimensional 
signals. Thus, in order to prove the result, we will require the following upper bound 
on the number of points required to construct the nets of points. (For an overview of 
results similar to [link] and of various related concentration of measure results, we 
refer the reader to the excellent introduction of [link].) 


Let € € (0, 1) be given. There exists a set of points Q such that || q ||, = 1 for all 
qd € Q,|Q| < (3/e)*, and for any « € R* with || a ||, = 1 there is a point g € Q 
satisfying || « —q ||, <€. 


We construct Q in a greedy fashion. We first select an arbitrary point gq, € R“ with 
|| @1 ||2= 1. We then continue adding points to Q so that at step i we add a point 

qi € R* with || q; ||2= 1 which satisfies || g; — q; ||2> € for all 7 < 7. This 
continues until we can add no more points (and hence for any x € R* with 

|| z ||, = 1 there is a point g € Q satisfying || x — q ||, < €.) Now we wish to 
bound |Q]. Observe that if we center balls of radius €/2 at each point in Q, then 
these balls are disjoint and lie within a ball of radius 1 + €/2. Thus, if BX (r) 
denotes a ball of radius r in R“, then 


Equation: 
|Q| - Vol(B* (€/2)) < Vol(B* (1 + €/2)) 

and hence 
Equation: 

_ Vol(BK (1+ «/2)) 

Ql < Vol(B* (€/2)) 
_ (L+e/2)* 
(</2)* 
< (8/e)". 


We now turn to our main theorem, which is based on the proof given in [link]. 


Fix 6 € (0,1). Let be an M x N random matrix whose entries y;, are i.i.d. with 
Pij™ SSub (1/M). If 
Equation: 


N 
M > «1K log (xz). 


then & satisfies the RIP of order K with the prescribed 6 with probability exceeding 
1 — 2e-*™, where Ky > 1 is arbitrary and Ky = 6?/2K — 1/K1— log (42e/0). 


First note that it is enough to prove [link] in the case || x ||, = 1, since @ is linear. 
Next, fix an index set T C {1,2,..., N} with |T| = K, and let X77 denote the K- 
dimensional subspace spanned by the columns of @. We choose a finite set of 


points Qr such that Q7 C Xz, || q ||, = 1 forall g € Qr, and for all e € X7 with 
|| z ||, = 1 we have 
Equation: 


min || 7 — < 6/14. 
min || ea lly <4/ 


From [link], we can choose such a set Q7 with JQrl< (42/ Or: We then repeat this 


process for each possible index set 7’, and collect all the sets Q7 together: 
Equation: 


Q= U ar. 


T:|T|=K 


There are es possible index sets 7’. We can bound this number by 


Equation: 
(2). eee ey, 


where the last inequality follows since from Sterling's approximation we have 

K! > (K/e)*. Hence |Q| < (42eN/5K)*. Since the entries of & are drawn 
according to a strictly sub-Gaussian distribution, from [link] we have [link]. We next 
use the union bound to apply [link] to this set of points with « = 6/ /2, with the 
result that, with probability exceeding 

Equation: 


1 — 2(42eN/5K)* e MP)" 


we have 
Equation: 


(1-6/v2)I ale < 2a < (1+4/v2)|lq 13, for all geQ. 


We observe that if M satisfies [link] then 


Equation: 


A2eN \* N A2e M A2e 
l <Kil l <—+M1 
06 (“S| = (toe (52) +18 (3°) a os () 


and thus [link] exceeds 1 — 2e~*™ as desired. 


We now define A as the smallest number such that 
Equation: 


|| Gz ||, << V1+A, for all ce Xx, || x ||, =1. 


Our goal is to show that A < 0. For this, we recall that for any x € ¥’x with 
|| z ||, = 1, we can pick a g € Q such that || « — q ||, < 6/14 and such that 
x—q€ Lx (since if x € X7, we can pick gq € Qr C X_f satisfying 

||  — q ||, < 6/14). In this case we have 

Equation: 


| Be || < || Fall + F(@—@) [ly < V1 4. 6/V24+ VIA - 4/14. 


Since by definition A is the smallest number for which [link] holds, we obtain 


V1+A< [1+ 6/v2+ VITA - 4/14. Therefore 


Equation: 


hea hades Ae 


“ieee a. 


as desired. We have proved the upper inequality in [link]. The lower inequality 
follows from this since 
Equation: 


| Bx Ip > || Pally — | @(@— 4) lly > 1 -8/V2- V1 +5 -8/14> V1 5 


which completes the proof. 


Above we prove above that the RIP holds with high probability when the matrix 
is drawn according to a strictly sub-Gaussian distribution. However, we are often 
interested in signals that are sparse or compressible in some orthonormal basis 

W + I, in which case we would like the matrix W to satisfy the RIP. In this setting 
it is easy to see that by choosing our net of points in the AK-dimensional subspaces 
spanned by sets of AK columns of Y, [link] will establish the RIP for W for & again 
drawn from a sub-Gaussian distribution. This universality of with respect to the 
sparsity-inducing basis is an attractive property that was initially observed for the 
Gaussian distribution (based on symmetry arguments), but we can now see is a 
property of more general sub-Gaussian distributions. Indeed, it follows that with 
high probability such a & will simultaneously satisfy the RIP with respect to an 
exponential number of fixed bases. 


Coherence 

In this module we introduce coherence, which provides a more 
computationally friendly alternative to conditions such as the spark, NSP, or 
RIP. We briefly describe the theoretical relationship between these 
conditions. 


While the spark, null space property (NSP), and restricted isometry 
property (RIP) all provide guarantees for the recovery of sparse signals, 
verifying that a general matrix @ satisfies any of these properties has a 
combinatorial computational complexity, since in each case one must 
essentially consider i) submatrices. In many settings it is preferable to 


use properties of @ that are easily computable to provide more concrete 
recovery guarantees. The coherence of a matrix is one such property [link], 
[link]. 


The coherence of a matrix ®, ~(®), is the largest absolute inner 
product between any two columns 9;, y; of &: 
Equation: 


(Gis 5) 


w(®)= max 7. 
1<i<j<N || y; |l2 || 9; ll. 


It is rele to show that the coherence of a matrix is always in the range 


 N-M _ . 
ye | M_D mr the lower bound is known as the Welch 


_ [link], [link], [link]. Note that when NV >> M, the lower bound is 
approximately pp (®) > 1/V M. 


One can sometimes relate coherence to the spark, NSP, and RIP. For 
example, the coherence and spark properties of a matrix can be related by 
employing the Gershgorin circle theorem [link], [link]. 

(Theorem 2 of [link]) 


The eigenvalues of an N x N matrix M with entries m;;, 1 < 1,7 < N, 
lie in the union of N discs d; = d; (c;, rj), 1 < i < N, centered at 


Cc; = m,; and with radius r; = ii |miyl. 


Applying this theorem on the Gram matrix G = G1 h A leads to the 
following straightforward result. 


For any matrix &, 
Equation: 


Since spark(®) does not depend on the scaling of the columns, we can 
assume without loss of generality that @ has unit-norm columns. Let 

A C {1,...,.N} with |A| = p determine a set of indices. We consider the 
restricted Gram matrix G = Gh A, which satisfies the following 
properties: 


© |gil< u(D,1<i,5<piF). 


then the matrix G is positive definite, so 


From [link], if )) 54; | 943 |<|gii 
that the columns of @, are linearly independent. Thus, the spark condition 
implies (p — 1)y(&) < 1 or, equivalently, p < 1+ 1/p() for all 

p < spark(&), yielding spark(@) > 1+ 1/p(®). 


By merging Theorem 1 from "Null space conditions" with [link], we can 
pose the following condition on @ that guarantees uniqueness. 
(Theorem 12 of [link]}) 


If 
Equation: 


then for each measurement vector y € R™ there exists at most one signal 
x € J'x such that y = 2. 


[link], together with the Welch bound, provides an upper bound on the level 
of sparsity K that guarantees uniqueness using coherence: K = O (v M ) . 


Another straightforward application of the Gershgorin circle theorem 
({link]) connects the RIP to the coherence property. 


If ® has unit-norm columns and coherence ps = p(®), then & satisfies the 
RIP of order K with 6 = (K — 1)y forall K < 1/p. 


The proof of this lemma is similar to that of [link]. 


The results given here emphasize the need for small coherence j4(&) for the 
matrices used in CS. Coherence bounds have been studied both for 
deterministic and randomized matrices. For example, there are known 
matrices © of size M x M? that achieve the coherence lower bound 
pl(P\=1/ VM, such as the Gabor frame generated from the Alltop 
sequence [link] and more general equiangular tight frames [link]. These 
constructions restrict the number of measurements needed to recover a K- 
sparse signal to be M = O(K 2 log N ). Furthermore, it can be shown that 
when the distribution used has zero mean and finite variance, then in the 
asymptotic regime (as M and N grow) the coherence converges to 

p(&) = / (2 log N)/M [link], [link], [link]. Such constructions would 
allow K to grow asymptotically as M = O(K 2 log N 5 matching the 
known finite-dimensional bounds. 


The measurement bounds dependent on coherence are handicapped by the 
squared dependence on the sparsity A, but it is possible to overcome this 
bottleneck by shifting the types of guarantees from worst-case/deterministic 
behavior, to average-case/probabilistic behavior [link], [link]. In this 
setting, we pose a probabilistic prior on the set of K-sparse signals x € XK 
. It is then possible to show that if & has low coherence j4(@) and spectral 
norm || $ ||,, and if K = O(y~? () log N), then the signal x can be 
recovered from its CS measurements y = @z with high probability. Note 
that if we replace the Welch bound, then we obtain K = O(M log N), 


which returns to the linear dependence of the measurement bound on the 
signal sparsity that appears in RIP-based results. 


Signal recovery via €_1-norm minimization 
This module introduces and motivates 2_1 minimization as a framework for 
Sparse recovery. 


As we will see later in this course, there now exist a wide variety of 
approaches to recover a sparse signal x from a small number of linear 
measurements. We begin by considering a natural first approach to the 
problem of sparse recovery. 


Given measurements y = @z and the knowledge that our original signal x 
is sparse or compressible, it is natural to attempt to recover x by solving an 
optimization problem of the form 

Equation: 


@=argmin ||z||, subjectto z¢ Ay), 
z 


where A(y) ensures that & is consistent with the measurements y. Recall 
that ||z||) = |supp (z)| simply counts the number of nonzero entries in z, so 
[link] simply seeks out the sparsest signal consistent with the observed 
measurements. For example, if our measurements are exact and noise-free, 
then we can set B(y) = {z : &z = y}. When the measurements have been 
contaminated with a small amount of bounded noise, we could instead set 
By) = {z: ||Gz—yl|, < e}. In both cases, [link] finds the sparsest x 
that is consistent with the measurements y. 


Note that in [link] we are inherently assuming that z itself is sparse. In the 
more common setting where z = Wa, we can easily modify the approach 
and instead consider 

Equation: 


@=argmin ||z||, subjectto z<¢ Ay) 
z 


where B(y) = {z: Wz = y} or B(y) = {z: ||GWz— yl, < e}. By 


setting & = SW we see that [link] and [link] are essentially identical. 
Moreover, as noted in "Matrices that satisfy the RIP", in many cases the 


introduction of W does not significantly complicate the construction of 


matrices & such that & will satisfy the desired properties. Thus, for most of 
the remainder of this course we will restrict our attention to the case where 
W = I. It is important to note, however, that this restriction does impose 
certain limits in our analysis when W is a general dictionary and not an 
orthonormal basis. For example, in this case 

|Z — x||, = ||We — Well, A ||@ — a|,, and thus a bound on ||é — cll, 
cannot directly be translated into a bound on ||% — ||, which is often the 
metric of interest. 


Although it is possible to analyze the performance of [link] under the 
appropriate assumptions on @, we do not pursue this strategy since the 
objective function ||-||,) is nonconvex, and hence [link] is potentially very 
difficult to solve. In fact, one can show that for a general matrix @, even 
finding a solution that approximates the true minimum is NP-hard. One 
avenue for translating this problem into something more tractable is to 
replace ||-||, with its convex approximation ||-||;. Specifically, we consider 
Equation: 


£=argmin ||z||, subjectto ze Aly). 
z 


Provided that @(y) is convex, [link] is computationally feasible. In fact, 
when A(y) = {z: &z = y}, the resulting problem can be posed as a linear 
program [link]. 


Approximation in 2; norm Approximation in £, quasinorm 


Best approximation of a point in R? by a a one-dimensional 


subspace using the £; norm and the é, quasinorm with p = 


It is clear that replacing [link] with [link] transforms a computationally 
intractable problem into a tractable one, but it may not be immediately 
obvious that the solution to [link] will be at all similar to the solution to 
[link]. However, there are certainly intuitive reasons to expect that the use 
of £; minimization will indeed promote sparsity. As an example, recall the 
example we discussed earlier shown in [link]. In this case the solutions to 
the 4; minimization problem coincided exactly with the solution to the £, 
minimization problem for any p < 1, and notably, is sparse. Moreover, the 
use of £; minimization to promote or exploit sparsity has a long history, 
dating back at least to the work of Beurling on Fourier transform 
extrapolation from partial observations [link]. 


Additionally, in a somewhat different context, in 1965 Logan [link] showed 
that a bandlimited signal can be perfectly recovered in the presence of 
arbitrary corruptions on a small interval. Again, the recovery method 
consists of searching for the bandlimited signal that is closest to the 
observed signal in the 4; norm. This can be viewed as further validation of 
the intuition gained from [link] — the @; norm is well-suited to sparse 
errors. 


Historically, the use of £; minimization on large problems finally became 
practical with the explosion of computing power in the late 1970's and early 
1980's. In one of its first applications, it was demonstrated that geophysical 


signals consisting of spike trains could be recovered from only the high- 
frequency components of these signals by exploiting 2; minimization [link], 
[link], [link]. Finally, in the 1990's there was renewed interest in these 
approaches within the signal processing community for the purpose of 
finding sparse approximations to signals and images when represented in 
overcomplete dictionaries or unions of bases [link], [link]. Separately, 2; 
minimization received significant attention in the statistics literature as a 
method for variable selection in linear regression, known as the 

Lasso [link]. 


Thus, there are a variety of reasons to suspect that £; minimization will 
provide an accurate method for sparse signal recovery. More importantly, 
this also provides a computationally tractable approach to the sparse signal 
recovery problem. We now provide an overview of £; minimization in both 
the noise-free and noisy settings from a theoretical perspective. We will 
then further discuss algorithms for performing ¢; minimization later in this 
course. 


Noise-free signal recovery 
This module establishes a simple performance guarantee of L1 
minimization for signal recovery with noise-free measurements. 


We now begin our analysis of 
Equation: 


£=argmin ||z||, subjectto ze Aly). 
z 


for various specific choices of A(y). In order to do so, we require the 
following general result which builds on Lemma 4 from "2; minimization 
proof". The key ideas in this proof follow from [link]. 


with dog < /2—1. Let x, € R% be given, and define h = & — z. Let 
Ao denote the index set corresponding to the K entries of x with largest 
magnitude and A, the index set corresponding to the K entries of hg with 


largest magnitude. Set A = Ag U Aj. If ||Z||, < |||], then 
Equation: 
oK(x) hy, Ph 
Pee es 
VK halls 

where 
Equation: 

1- (1- v2) 60x ; 

Cy =2 C= 


1— (1+-V2) box 1— (14 -V2)éox 


We begin by observing that h = hy + hye, so that from the triangle 
inequality 
Equation: 


All = Palle + [hve 


2° 


We first aim to bound ||hz< 
proof" we have 


>. From Lemma 3 from ""€;_minimization 


Equation: 
Pasll 
haclla =[]D Pall < Dd) [Rall < —, 
2, ; 2, ‘VK 


where the A; are defined as before, i.e., A; is the index set corresponding to 
the K largest entries of h Ag (in absolute value), A as the index set 
corresponding to the next A largest entries, and so on. 


We now wish to bound ||h 4 


_- Since ||x||, > ||Z||,, by applying the 


triangle inequality we obtain 
Equation: 

Ile], > |lez+ All, = |[@4, + haglly + |e 49 =e hae 

2 IZ Ao ll — RAoll + I|Pag 


1 
1 Ewe 


i. 


Rearranging and again applying the triangle inequality, 
Equation: 


[Pag 1 = kale = IZ Ao lh a ai as I|z.A¢ 


< |la — @ As || 4 + Ido lly a5 |@A5 


1 


i. 


Recalling that ox (x), = |l@ae||, = lle — ally 


Equation: 


[hac 


1S WRaglh + 20K(#)1. 


Combining this with [link] we obtain 
Equation: 


Aol + 20K (2), 


VK 


K(z), 


o 
I[haclle S < |[haylly +2 77 


where the last inequality follows from standard bounds on £, norms 
(Lemma 1 from "The RIP and the NSP"). By observing that 

[Ag lle < ||2.a||_ this combines with [link] to yield 

Equation: 


4 bf 
All, < es a 


We now turn to establishing a bound for ||,||,. Combining Lemma 4 from 
"1 minimization proof" with [link] and again applying standard bounds on 
£, norms we obtain 

Equation: 


racll, + plha, Ph)| 


|hallp <a 


VK Palle 
6g el tg, glPha 2 
VK Pall 
x Phy, fh 
< al[hay||, + 2a HE) ale pith ath) 
VE [alls 
since Paglls < |[hall,, 
Equation: 
Oh hy, Ph 
(1—a)||hal|, < 2a K(@)1 gigi hase rn 


VK Ralls 


The assumption that don < /2 — 1ensures that a < 1. Dividing by 
(1 — qa) and combining with [link] results in 
Equation: 


Aa oK(x), 28 |(Ghy, &h)| 
Ia < (So, +2) AR OE ee 


Plugging in for a and £ yields the desired constants. 


[link] establishes an error bound for the class of 2, minimization algorithms 
described by [link] when combined with a measurement matrix @ satisfying 
the RIP. In order to obtain specific bounds for concrete examples of A(y), 
we must examine how requiring  € @(y) affects |(Ghy, &h)|. As an 
example, in the case of noise-free measurements we obtain the following 
theorem. 

(Theorem 1.1 of [link]}) 


Suppose that & satisfies the RIP of order 2K with dox < V2 —1and we 
obtain measurements of the form y = zx. Then when 

By) = {z: &z = y}, the solution Z to [link] obeys 

Equation: 


(t)1 


as Ok 
2 — al], < Co— 


Since x € A(y) we can apply [link] to obtain that for h = % — a, 
Equation: 


Phy, Gh 
Walle < Cp a2 oh A )| 


JK Wall, 


Furthermore, since z, © € AB (y) we also have that y = x = OF and 
hence h — 0. Therefore the second term vanishes, and we obtain the 
desired result. 


[link] is rather remarkable. By considering the case where 

ze Six = {ax: |\x||, <A} we can see that provided @ satisfies the RIP 
— which as shown earlier allows for as few as O(K log (N’/K)) 
measurements — we can recover any K-sparse xexactly. This result seems 
improbable on its own, and so one might expect that the procedure would 
be highly sensitive to noise, but we will see next that [link] can also be used 
to demonstrate that this approach is actually stable. 


Note that [link] assumes that @ satisfies the RIP. One could easily modify 
the argument to replace this with the assumption that # satisfies the null 
noiseless setting, in which case h lies in the null space of ®, then [link] 
simplifies and its proof could be broken into two steps: (2) show that if & 
satisfies the RIP then it satisfies the NSP (as shown in "The RIP and the 
NSP"), and (22) the NSP implies the simplified version of [link]. This proof 
directly mirrors that of [link]. Thus, by the same argument as in the proof of 
[link], it is straightforward to show that if ® satisfies the NSP then it will 
obey the same error bound. 


Signal recovery in noise 

This module establishes a number of results concerning various L1 minimization 
algorithms designed for sparse signal recovery from noisy measurements. The 
results in this module apply to both bounded noise as well as Gaussian (or more 
generally, sub-Gaussian) noise. 


The ability to perfectly reconstruct a sparse signal from noise-free measurements 
represents a promising result. However, in most real-world systems the 
measurements are likely to be contaminated by some form of noise. For instance, in 
order to process data in a computer we must be able to represent it using a finite 
number of bits, and hence the measurements will typically be subject to 
quantization error. Moreover, systems which are implemented in physical hardware 
will be subject to a variety of different types of noise depending on the setting. 


Perhaps somewhat surprisingly, one can show that it is possible to modify 
Equation: 


£=argmin ||z||, subjectto ze Ay). 


to stably recover sparse signals under a variety of common noise models [link], 


extremely useful in establishing performance guarantees in noise. 


In our analysis we will make repeated use of Lemma 1 from "Noise-free signal 
recovery, So we repeat it here for convenience. 


Suppose that & satisfies the RIP of order 2K with dox < /2—1.Letz,@ € RY 
be given, and define h = & — x. Let Ap denote the index set corresponding to the 
K entries of x with largest magnitude and A, the index set corresponding to the K 
entries of hz¢ with largest magnitude. Set A = Ap U Ai. If ||%||, < ||z|],, then 


Equation: 


cla), o, @has Ph)| 


oO 
|All, < Co 
° VK Ralls 


where 
Equation: 


te (1 os V2) 52x ; 


wT (avian ST 4 valine 


Bounded noise 


We first provide a bound on the worst-case performance for uniformly bounded 
noise, as first investigated in [Link]. 
(Theorem 1.2 of [link]) 


Suppose that & satisfies the RIP of order 2K with dox < V2 —1and let 
y = $x + e where |le||, < €. Then when A (y) = {z: ||Gz — yl, < e}, the 
solution Z to [link] obeys 


Equation: 
OK(x£ 
Oe 
K 
where 
Equation: 
= (1 - v2) dex a 
Cy =2 Cy =4 02K 


1— (1+ V2) box 1— (14+ V2) dx 


We are interested in bounding ||h||, = ||z — a||,. Since |le||, < «, z € Ay), and 
therefore we know that ||@||, < ||a||,. Thus we may apply [link], and it remains to 
bound |(@hy, &h)|. To do this, we observe that 

Equation: 


|Phl|, = |P(F — x) ||, = ||$F —y + y — Sal, < || OF — y|| + lly — Pally < 2 


ll, 


where the last inequality follows since x, € & (y). Combining this with the RIP 
and the Cauchy-Schwarz inequality we obtain 
Equation: 


\(Gh4,Bh)| < ||GA4||,||GAll, < 2/1 + dox||Aall,- 
Thus, 
Equation: 


OK(X), 


VK 


OK(r), 


|All, < Co + C,26/1 + bax = Co deOye: 


completing the proof. 


In order to place this result in context, consider how we would recover a sparse 
vector x if we happened to already know the K locations of the nonzero 
coefficients, which we denote by Ap. This is referred to as the oracle estimator. In 
this case a natural approach is to reconstruct the signal using a simple 
pseudoinverse: 

Equation: 


Epp = Py y= (P4,P4,) Pay 


£ 4s = 0. 


The implicit assumption in [link] is that @4, has full column-rank (and hence we are 
considering the case where @4, is the MM x K matrix with the columns indexed by 
A§ removed) so that there is a unique solution to the equation y = ®4,x ,,. With 
this choice, the recovery error is given by 

Equation: 


x -1 -1 
[2 — xl, = |(@4,F4,) P4, (Se +e) — zl], =I(F1,F 4) PA,€ll>: 


We now consider the worst-case bound for this error. Using standard properties of 
the singular value decomposition, it is straightforward to show that if @ satisfies the 


RIP of order 2K (with constant 62), then the largest singular value of \ lies in 


the range E /V1+62ex,1/ ‘7 1— Sax | . Thus, if we consider the worst-case 


recovery error over all e such that ||e||, < e, then the recovery error can be bounded 
by 
Equation: 


€ we € 
< |@ — all, < 


/1 + box Joo 


Therefore, if x is exactly K-sparse, then the guarantee for the pseudoinverse 
recovery method, which is given perfect knowledge of the true support of x, cannot 
improve upon the bound in [link] by more than a constant value. 


We now examine a slightly different noise model. Whereas [link] assumed that the 
noise norm |le||, was small, the theorem below analyzes a different recovery 
algorithm known as the Dantzig selector in the case where ||®* e|| .. is small [link]. 
We will see below that this will lead to a simple analysis of the performance of this 
algorithm in Gaussian noise. 


Suppose that & satisfies the RIP of order 2K with 69x < V2 —1 and we obtain 
measurements of the form y = ®a + e where ||@7 el] ., < . Then when 
By) = {z: ||G7 (Gz — y)||,, < A}, the solution @ to [link] obeys 


loo 


Equation: 
OK(x — 
|Z — z||, < Co K(@), + C3VK), 
where 
Equation: 


oy ot (toy?) _ wi | 
fps (1+ v2) d0x f= (1+ V2) dex 


The proof mirrors that of [link]. Since ||®7el||,, < A, we again have that z € Ay), 


so ||Z||, < ||a||, and thus [link] applies. We follow a similar approach as in [link] to 
bound |(@hy, h)|. We first note that 
Equation: 


||P" hl. < IF (SF—y) |], + IIS (y- Fx) IIo S 2A 


where the last inequality again follows since z, € A(y). Next, note that 
hy = Pyhy. Using this we can apply the Cauchy-Schwarz inequality to obtain 
Equation: 


(Ghz, Ph)| = |(ha, P4Ph)| < |lhallall@4PAllo. 


Finally, since ||@7Gh||_. < 2A, we have that every coefficient of $7 Sh is at most 
2A, and thus ||@7,fh||, < V2K (2A). Thus, 
Equation: 


OK(2), oK(z), 


PII. <C + Cy2V2KA = Co = C3V RK), 


as desired. 


Gaussian noise 


Finally, we also examine the performance of these approaches in the presence of 
Gaussian noise. The case of Gaussian noise was first considered in [link], which 
examined the performance of £9 minimization with noisy measurements. We now 
see that [link] and [link] can be leveraged to provide similar guarantees for £1 
minimization. To simplify our discussion we will restrict our attention to the case 
where x € ix = {x: ||x||) < AK}, so that ox(x), = 0 and the error bounds in 
[link] and [link] depend only on the noise e. 


To begin, suppose that the coefficients of e € R™ are i.i.d. according to a Gaussian 
distribution with mean zero and variance o”. Since the Gaussian distribution is itself 
sub-Gaussian, we can apply results such as Corollary 1 from "Concentration of 
measure for sub-Gaussian random variables" to show that there exists a constant 

co > O such that for any € > 0, 

Equation: 


P(tlell > (1+ <)VMo) <exp (—cpe’M). 


Applying this result to [link] with e = 1, we obtain the following result for the 
special case of Gaussian noise. 


Suppose that & satisfies the RIP of order 2K with dox < V2 — 1. Furthermore, 
suppose that x € 2x and that we obtain measurements of the form y = x + e 
where the entries of e are i.i.d. WY (0, go) Then when 


By) = {2 : ||Sz —yll, < 2/Mo}, the solution Z to [link] obeys 


Equation: 


/1+ dex 


1— (14+ v2) box ae 


|z— zl, <8 


with probability at least 1— exp (—coM). 


We can similarly consider [link] in the context of Gaussian noise. If we assume that 
the columns of & have unit norm, then each coefficient of 7 e is a Gaussian 
random variable with mean zero and variance o”. Using standard tail bounds for the 
Gaussian distribution (see Theorem 1 from '"Sub-Gaussian random variables"), we 
have that 

Equation: 


P(|[®"e],| > ta) <exp (~#°/2) 


for 2 = 1, 2,...,. Thus, using the union bound over the bounds for different 7, we 
obtain 
Equation: 


a 1 
P(IIO7 ella. > 2y/log No) < N exp (—2 log N) = NW: 


Applying this to [link], we obtain the following result, which is a simplified version 
of Theorem 1.1 of [link]. 


Suppose that @ has unit-norm columns and satisfies the RIP of order 2K with 
box < V2 —1. Furthermore, suppose that x € 4’, and that we obtain 
measurements of the form y = @x + e where the entries of e are i.i.d. WY (0, a”). 
Then when 4 (y) = {2 : |B" (Bz — y) ||, < 2/log No}, the solution Z to 


[link] obeys 
Equation: 


1+6 
ie—2,< 40 


1 (143) YEN 


with probability at least 1 — +. 


Ignoring the precise constants and the probabilities with which the bounds hold 
(which we have made no effort to optimize), we observe that if M = O(K log N) 
then these results appear to be essentially the same. However, there is a subtle 
difference. Specifically, if IM and WN are fixed and we consider the effect of varying 
kK, we can see that [link] yields a bound that is adaptive to this change, providing a 
stronger guarantee when K is small, whereas the bound in [link] does not improve 
as K is reduced. Thus, while they provide very similar guarantees, there are certain 
circumstances where the Dantzig selector is preferable. See [link] for further 
discussion of the comparative advantages of these approaches. 


Instance-optimal guarantees revisited 

In this module we demonstrate the difficulty of obtaining instance-optimal guarantees 
in the L2 norm. We then show that it is much easier to obtain such guarantees in the 
probabilistic setting. 


We now briefly return to the noise-free setting to take a closer look at instance- 
optimal guarantees for recovering non-sparse signals. To begin, recall that in 
Theorem 1 from ''Noise-free signal recovery" we bounded the @2-norm of the 
reconstruction error of 

Equation: 


x=argmin ||z||, subjectto ze A(y). 
z 


as 
Equation: 


2 — zl, < Coox(2),/VK 


when A(y) = {z: &z = y}. One can generalize this result to measure the 
reconstruction error using the £,-norm for any p € [1, 2]. For example, by a slight 
modification of these arguments, one can also show that ||@ — «||, < Coox(z), 
(see [link]). This leads us to ask whether we might replace the bound for the £5 error 
with a result of the form ||z — a||, < Cox(x),. Unfortunately, obtaining such a 
result requires an unreasonably large number of measurements, as quantified by the 
following theorem of [link]. 

(Theorem 5.1 of [link]) 


Suppose that is an M x N matrix and that A: R™ — RN is a recovery algorithm 
that satisfies 
Equation: 


jz — A(Pz)||p < Cox(x)» 


forsome K > 1, then M > (1 —,/1— 1/6?) N. 


We begin by letting h € R denote any vector in.W(#). We write h = hy + hae 
where A is an arbitrary set of indices satisfying |A| < K. Set x = hye, and note that 
Ox = Shy: = Gh — Shy = —Ghy since hh € V(#). Since ha € Ux, [link] 


implies that A (x) = A(—Shy) = —hy. Hence, 
||z — A(Sz)||. = ||hac — (—Aa)||2 = ||A||,. Furthermore, we observe that 

oK(«). < |x|], since by definition ox (x), < || — &||, for all & € S’x, including 
= 0. Thus ||h||, < Cl|acl|g. Since |[Al|3 = ||halld + [hac 


Equation: 
1 1 
$< MIB Sell = (1 - Sy) IA 


This must hold for any vector h € oe (®) and for any set of indices A such that 

|A| < K.In parce let {ui}N5 Fi ™ be an orthonormal basis for .”(®), and define 
the vectors {h;}) ;—1 aS follows: 

Equation: 


- this yields 


Palla = [IPll2 — [Peas 


N-M 
v4 (J)v 
i=1 


We note that h; = so (ej, U;)v; Where e; denotes the vector of all zeros except 
for a 1 in the j-th entry. Thus we see that h; = P_ye; where Py denotes an 
orthogonal projection onto .W(@). Since |Pve;|l3 + |Presll, = llesll5 = 1, we 
have that ||h;||, < 1. Thus, by setting A = {7} for h; we observe that 

Equation: 


= [hs Ps (1 Ge JIM <1 - Ge 


Summing over 7 = 1, 2,..., N, we obtain 
Equation: 


N io N-M N N-M 
Ny/1-1/0?> >> GP = De lel = NM 
j=l i=1 w=1 jg=1 i=l 


and thus M > (1 ey 1/C?) N as desired. 


Thus, if we want a bound of the form [link] that holds for all signals x with a constant 
C’' = 1, then regardless of what recovery algorithm we use we will need to take 

M ~= N measurements. However, in a sense this result is overly pessimistic, and we 
will now see that the results we just established for signal recovery in noise can 
actually allow us to overcome this limitation by essentially treating the approximation 
error as noise. 


Towards this end, notice that all the results concerning 4; minimization stated thus far 
are deterministic instance-optimal guarantees that apply simultaneously to all x given 
theoretical property, but as noted in "Matrices that satisfy the RIP", in practice it is 
very difficult to obtain a deterministic guarantee that the matrix @ satisfies the RIP. In 
particular, constructions that rely on randomness are only known to satisfy the RIP 
with high probability. As an example, recall Theorem 1 from "Matrices that satisfy 
the RIP", which opens the door to slightly weaker results that hold only with high 
probability. 


Fix 6 € (0,1). Let be an M x N random matrix whose entries y;; are i.i.d. with 
(y;; drawn according to a strictly sub-Gaussian distribution with e=1/ii 
Equation: 


M > «1K log (z). 


then & satisfies the RIP of order K with the prescribed 6 with probability exceeding 
1 — 2e-™, where x} is arbitrary and Ky = 62/2« — log (42e/6)/kK1. 


Even within the class of probabilistic results, there are two distinct flavors. The 
typical approach is to combine a probabilistic construction of a matrix that will satisfy 
the RIP with high probability with the previous results in this chapter. This yields a 
procedure that, with high probability, will satisfy a deterministic guarantee applying 
to all possible signals x. A weaker kind of result is one that states that given a signal 
x, we can draw a random matrix & and with high probability expect certain 
performance for that signal x. This type of guarantee is sometimes called instance- 
optimal in probability. The distinction is essentially whether or not we need to draw a 
new random @ for each signal x. This may be an important distinction in practice, but 
if we assume for the moment that it is permissible to draw a new matrix @ for each z, 
then we can see that [link] may be somewhat pessimistic. In order to establish our 
main result we will rely on the fact, previously used in "Matrices that satisfy the RIP", 
that sub-Gaussian matrices preserve the norm of an arbitrary vector with high 
probability. Specifically, a slight modification of Corollary 1 from !"Matrices that 


satisfy the RIP" shows that for any x € RY, if we choose & according to the 
procedure in [link], then we also have that 
Equation: 


P( || Sa ||> > 2ll @ 13) <exp (—K3M) 


with K3 = 4/« . Using this we obtain the following result. 


Let z € RN be fixed. Set dox < V2—1 Suppose that ® is an M x N sub-Gaussian 
random matrix with M > «,K log (N/K). Suppose we obtain measurements of the 
form y = @z. Set € = 20x (x),. Then with probability exceeding 

1 — 2 exp (—kK2M)-— exp (—K3M), when B(y) = {z: ||z — y||. < e}, the 
solution % to [link] obeys 

Equation: 


84/1 + dex — (1 4 v2) 50x 
1— (1 + v2) box 


|[Z@— all, < oK(2)». 


First we recall that, as noted above, from [link] we have that @ will satisfy the RIP of 
order 2K with probability at least 1 — 2 exp (—K2M). Next, let A denote the index 
set corresponding to the K entries of x with largest magnitude and write 

L=2A+ 2,4. Since ry, € XK, wecan write $x = 6744+ Ory. = Gt, +e. If @ 
is sub-Gaussian then from Lemma 2 from "Sub-Gaussian random variables" we have 
that z 4- is also sub-Gaussian, and one can apply [link] to obtain that with 
probability at least 1— exp (—K3M), ||@xacl|, < 2||aac||) = 20K («),. Thus, 
applying the union bound we have that with probability exceeding 

1 — 2 exp (—K2M)— exp (—Kk3M), we satisfy the necessary conditions to apply 
Theorem 1 from "Signal recovery in noise" to 4, in which case ox(x,), = 0 and 
hence 

Equation: 


|Z — eally < 2CroK(2)>. 


From the triangle inequality we thus obtain 
Equation: 


|e — all, = ||@ -—aa+ ea —al|p < |@— all, + lea — alla < (2C2 + Iox(x), 


which establishes the theorem. 


Thus, although it is not possible to achieve a deterministic guarantee of the form in 
[link] without taking a prohibitively large number of measurements, it is possible to 
show that such performance guarantees can hold with high probability while 
simultaneously taking far fewer measurements than would be suggested by [link]. 
Note that the above result applies only to the case where the parameter is selected 
correctly, which requires some limited knowledge of x, namely ox (x)4. In practice 
this limitation can easily be overcome through a parameter selection technique such 
as cross-validation [link], but there also exist more intricate analyses of £1 
minimization that show it is possible to obtain similar performance without requiring 
an oracle for parameter selection [link]. Note that [link] can also be generalized to 
handle other measurement matrices and to the case where z is compressible rather 
than sparse. Moreover, this proof technique is applicable to a variety of the greedy 
algorithms described later in this course that do not require knowledge of the noise 
level to establish similar results [link], [link]. 


The cross-polytope and phase transitions 
In this module we provide an overview of the relationship between L1 
minimization and random projections of the cross-polytope. 


(RIP) described in "Signal recovery_in noise" allows us to establish a 
variety of guarantees under different noise settings, but one drawback is that 
the analysis of how many measurements are actually required for a matrix 
to satisfy the RIP is relatively loose. An alternative approach to analyzing 
£, minimization algorithms is to examine them from a more geometric 
perspective. Towards this end, we define the closed £, ball, also known as 
the cross-polytope: 

Equation: 


CN = {2 ER: |x|, <1}. 


Note that C is the convex hull of 2 points {p}en. Let $C% CR” 


denote the convex polytope defined as either the convex hull of {Bp}, 


or equivalently as 
Equation: 


CN = {ye R™ :y=$2,2 EC}. 


For any z € Xx = {z: ||z||) < K}, we can associate a K-face of C% 
with the support and sign pattern of 2. One can show that the number of K- 
faces of $C% is precisely the number of index sets of size K for which 
signals supported on them can be recovered by 

Equation: 


£=argmin ||z||, subjectto z¢€ A(y). 
z 


with A(y) = {z: &z = y}. Thus, 2; minimization yields the same solution 
as £9 minimization for all x € 4’, if and only if the number of K-faces of 
®CN is identical to the number of K-faces of C. Moreover, by counting 


the number of K-faces of 6C, we can quantify exactly what fraction of 
sparse vectors can be recovered using £; minimization with # as our 
sensing matrix. See [link], [link], [link], [link], [link] for more details 
and [link] for an overview of the implications of this body of work. Note 
also that by replacing the cross-polytope with certain other polytopes (the 
simplex and the hypercube), one can apply the same technique to obtain 
results concerning the recovery of more limited signal classes, such as 
sparse signals with nonnegative or bounded entries [link]. 


Given this result, one can then study random matrix constructions from this 
perspective to obtain probabilistic bounds on the number of K-faces of 
BON with G is generated at random, such as from a Gaussian distribution. 
Under the assumption that kK = pM and M = YN, one can obtain 
asymptotic results as NV — oo. This analysis leads to the phase transition 
phenomenon, where for large problem sizes there are sharp thresholds 
dictating that the fraction of K-faces preserved will tend to either one or 
zero with high probability, depending on p and ¥ [Link]. 


These results provide sharp bounds on the minimum number of 
measurements required in the noiseless setting. In general, these bounds are 
significantly stronger than the corresponding measurement bounds obtained 
within the RIP-based framework given in "Noise-free signal recovery", 
which tend to be extremely loose in terms of the constants involved. 
However, these sharper bounds also require somewhat more intricate 
analysis and typically more restrictive assumptions on @ (such as it being 
Gaussian). Thus, one of the main strengths of the RIP-based analysis 
presented in "Noise-free signal recovery” and ''Signal recovery in noise" is 
that it gives results for a broad class of matrices that can also be extended to 
noisy settings. 


Sparse recovery algorithms 
This module introduces some of the tradeoffs involved in the design of 
Sparse recovery algorithms. 


Given noisy compressive measurements y = zx + e of a signal x, a core 
problem in compressive sensing (CS) is to recover a sparse signal x from a 
set of measurements y. Considerable efforts have been directed towards 
developing algorithms that perform fast, accurate, and stable reconstruction 
of x from y. As we have seen in previous chapters, a “good” CS matrix 
typically satisfies certain geometric conditions, such as the restricted 

ways in order to drive down the number of measurements, enable faster 
reconstruction, and ensure robustness to both numerical and stochastic 
errors. 


The design of sparse recovery algorithms are guided by various criteria. 
Some important ones are listed as follows. 


e Minimal number of measurements. Sparse recovery algorithms 
must require approximately the same number of measurements (up to a 
small constant) required for the stable embedding of K-sparse signals. 

e Robustness to measurement noise and model mismatch Sparse 
recovery algorithms must be stable with regards to perturbations of the 
input signal, as well as noise added to the measurements; both types of 
errors arise naturally in practical systems. 

e Speed. Sparse recovery algorithms must strive towards expending 
minimal computational resources, Keeping in mind that a lot of 
applications in CS deal with very high-dimensional signals. 

¢ Performance guarantees. In previous chapters, we have already 
seen a range of performance guarantees that hold for sparse signal 
recovery using ¢; minimization. In evaluating other algorithms, we 
will have the same considerations. For example, we can choose to 
design algorithms that possess instance-optimal or probabilistic 
guarantees. We can also choose to focus on algorithm performance for 
the recovery of exactly K-sparse signals z, or consider performance 
for the recovery of general signals xs. Alternately, we can also 


consider algorithms that are accompanied by performance guarantees 
in either the noise-free or noisy settings. 


A multitude of algorithms satisfying some (or even all) of the above have 
been proposed in the literature. While it is impossible to describe all of 
them in this chapter, we refer the interested reader to the DSP resources 
webpage for a more complete list of recovery algorithms. Broadly speaking, 
recovery methods tend to fall under three categories: convex optimization- 
of the chapter discusses several properties and example algorithms of each 
flavor of CS reconstruction. 


Convex optimization-based methods 
This module provides an overview of convex optimization approaches to 
Sparse signal recovery. 


An important class of sparse recovery_algorithms fall under the purview of 
convex optimization. Algorithms in this category seek to optimize a convex 
function f(-) of the unknown variable x over a (possibly unbounded) 
convex subset of R. 


Setup 


Let J(x) be a convex sparsity-promoting cost function (i.e., J(x) is small 
for sparse x.) To recover a sparse signal representation % from 
measurements y = x, @ € RY we may either solve 

Equation: 


min {J (x): y= zr}, 


when there is no noise, or solve 
Equation: 


min {J (c) : H (#z,y) < ¢} 


when there is noise in the measurements. Here, H is a cost function that 
penalizes the distance between the vectors x and y. For an appropriate 
penalty parameter py, [link] is equivalent to the unconstrained formulation: 
Equation: 


min J (x) + wH (&z, y) 


for some jz > 0. The parameter fz may be chosen by trial-and-error, or by 
Statistical techniques such as cross-validation [link]. 


For convex programming algorithms, the most common choices of J and H 
are usually chosen as follows: J (x) = || x ||,, the 21-norm of z, and 


H ($z,y) = +|| Sz —y | *, the £)-norm of the error between the observed 
measurements and the linear projections of the target vector x. In statistics, 
minimizing this H subject to || x ||, < 6 is known as the Lasso problem. 
More generally, J(-) acts as a regularization term and can be replaced by 
other, more complex, functions; for example, the desired signal may be 
piecewise constant, and simultaneously have a sparse representation under a 
known basis transform W. In this case, we may use a mixed regularization 
term: 

Equation: 


J(z)=TV (x) +A |] @ {I 


It might be tempting to use conventional convex optimization packages for 
the above formulations ({link], [link], and [link]). Nevertheless, the above 
problems pose two key challenges which are specific to practical problems 
encountered in CS: (i) real-world applications are invariably large-scale (an 
image of a resolution of 1024 x 1024 pixels leads to optimization over a 
million variables, well beyond the reach of any standard optimization 
software package); (ii) the objective function is nonsmooth, and standard 
smoothing techniques do not yield very good results. Hence, for these 
problems, conventional algorithms (typically involving matrix 
factorizations) are not effective or even applicable. These unique challenges 
encountered in the context of CS have led to considerable interest in 
developing improved sparse recovery algorithms in the optimization 
community. 


Linear programming 


In the noiseless case, the 2;-minimization problem (obtained by substituting 
J (x) = || a ||, in [link]) can be recast as a linear program (LP) with 
equality constraints. These can be solved in polynomial time (O (NV 3)) 


using standard interior-point methods [link]. This was the first feasible 
reconstruction algorithm used for CS recovery and has strong theoretical 


guarantees, as shown earlier in this course. In the noisy case, the problem 
can be recast as a second-order cone program (SOCP) with quadratic 
constraints. Solving LPs and SOCPs is a principal thrust in optimization 
research; nevertheless, their application in practical CS problems is limited 
due to the fact that both the signal dimension N, and the number of 
constraints M, can be very large in many scenarios. Note that both LPs and 
SOCPs correspond to the constrained formulations in [link] and [link] and 
are solved using first order interior-point methods. 


A newer algorithm called “11_Is" [link] is based on an interior-point 
algorithm that uses a preconditioned conjugate gradient (PCG) method to 
approximately solve linear systems in a truncated-Newton framework. The 
algorithm exploits the structure of the Hessian to construct their 
preconditioner; thus, this is a second order method. Computational results 
show that about a hundred PCG steps are sufficient for obtaining accurate 
reconstruction. This method has been typically shown to be slower than 
first-order methods, but could be faster in cases where the true target signal 
is highly sparse. 


Fixed-point continuation 


As opposed to solving the constrained formulation, an alternate approach is 
to solve the unconstrained formulation in [link]. A widely used method for 
solving £,-minimization problems of the form 

Equation: 


min pl ||, +H (2), 


for a convex and differentiable H, is an iterative procedure based on 
shrinkage (also called soft thresholding; see [link] below). In the context of 
solving [link] with a quadratic H, this method was independently proposed 
and analyzed in [link], [link], [link], [link], and then further studied or 
extended in [link], [link], [link], [link], [link], [link]. Shrinkage is a classic 
method used in wavelet-based image denoising. The shrinkage operator on 
any scalar component can be defined as follows: 

Equation: 


t—a ift>a, 
shrink (t,a)= 0 if—a<t<a,and 
t+ta ift<-—a. 


This concept can be used effectively to solve [link]. In particular, the basic 
algorithm can be written as following the fixed-point iteration: for 

i =1,...,.N, the i*" coefficient of x at the (k + 1) time step is given by 
Equation: 


aghtt = shrink ((x* —TVH (z")) 5 UT) 


where T > 0 serves as a step-length for gradient descent (which may vary 
with k) and p is as specified by the user. It is easy to see that the larger ju is, 
the larger the allowable distance between x**1 and x*. For a quadratic 
penalty term H(-), the gradient \7H can be easily computed as a linear 
function of «*; thus each iteration of [link] essentially boils down to a small 
number of matrix-vector multiplications. 


The simplicity of the iterative approach is quite appealing, both from a 
computational, as well as a code-design standpoint. Various modifications, 
enhancements, and generalizations to this approach have been proposed, 
both to improve the efficiency of the basic iteration in [link], and to extend 
its applicability to various kinds of J [link], [link], [link]. In principle, the 
basic iteration in [link] would not be practically effective without a 
continuation (or path-following) strategy [link], [link] in which we choose a 
gradually decreasing sequence of values for the parameter pu to guide the 
intermediate iterates towards the final optimal solution. 


This procedure is known as continuation; in [link], the performance of an 
algorithm known as Fixed-Point Continuation (FPC) has been compared 
favorably with another similar method known as Gradient Projection for 
Sparse Reconstruction (GPSR) [link] and “l1_lIs” [link]. A key aspect to 
solving the unconstrained optimization problem is the choice of the 
parameter ps. As discussed above, for CS recovery, s may be chosen by trial 


and error; for the noiseless constrained formulation, we may solve the 
corresponding unconstrained minimization by choosing a large value for pz. 


In the case of recovery from noisy compressive measurements, a commonly 
used choice for the convex cost function H(z) is the square of the norm of 
the residual. Thus we have: 

Equation: 


H(z) =||y—-%z ||; 
VH(x) =26' (y—@z). 


For this particular choice of penalty function, [link] reduces to the 
following iteration: 
Equation: 


o}*! = shrink((2* — ry H(y— 2°), wr) 


which is run until convergence to a fixed point. The algorithm is detailed in 
pseudocode form below. 


Inputs: CS matrix @, signal measurements y, 
parameter sequence [, 
Outputs: Signal estimate x 
initialize: %=0, r=y, k=0. 
while halting criterion false do 
1. k¢ k+1 
2. £«+ £—7Th'r {take a gradient step} 


3. << shrink (x, p,7) {perform soft 
thresholding} 
4. r<+~y—@z {update measurement residual} 


end while 
return 2<2@ = 


Bregman iteration methods 


It turns out that an efficient method to obtain the solution to the constrained 
optimization problem in [link] can be devised by solving a small number of 
the unconstrained problems in the form of [link]. These subproblems are 
commonly referred to as Bregman iterations. A simple version can be 
written as follows: 

Equation: 


yt! =yb+y— oa! 
a*t! —argmin J (x) + 5! bx — y*t I". 


The problem in the second step can be solved by the algorithms reviewed 
above. Bregman iterations were introduced in [link] for constrained total 
variation minimization problems, and was proved to converge for closed, 
convex functions J(z). In [link], it is applied to [link] for J (x) = || z ||, 
and shown to converge in a finite number of steps for any yz > 0. For 
moderate ps, the number of iterations needed is typically lesser than 5. 
Compared to the alternate approach that solves [link] through directly 
solving the unconstrained problem in [link] with a very large 4, Bregman 
iterations are often more stable and sometimes much faster. 


Discussion 


All the methods discussed in this section optimize a convex function 
(usually the £;-norm) over a convex (possibly unbounded) set. This implies 
guaranteed convergence to the global optimum. In other words, given that 
the sampling matrix @ satisfies the conditions specified in "Signal recovery 
via 4; minimization", convex optimization methods will recover the 
underlying signal z. In addition, convex relaxation methods also guarantee 
stable recovery by reformulating the recovery problem as the SOCP, or the 
unconstrained formulation. 


Greedy algorithms 
In this module we provide an overview of some of the most common 
greedy algorithms and their application to the problem of sparse recovery. 


Setup 


As opposed to solving a (possibly computationally expensive) convex 
optimization program, an alternate flavor to sparse recovery is to apply 
methods of sparse approximation. Recall that the goal of sparse recovery is 
to recover the sparsest vector x which explains the linear measurements y. 
In other words, we aim to solve the (nonconvex) problem: 

Equation: 


min \FZi:y= > gia 
a 1Ef 


where -¥ denotes a particular subset of the indices 7 = 1,..., N, and y; 
denotes the i*® column of &. It is well known that searching over the power 
set formed by the columns of & for the optimal subset 4° with smallest 
cardinality is NP-hard. Instead, classical sparse approximation methods 
tackle this problem by greedily selecting columns of ¢ and forming 
successively better approximations to y. 


Matching Pursuit 


Matching Pursuit (MP), named and introduced to the signal processing 
community by Mallat and Zhang [link], [link], is an iterative greedy 
algorithm that decomposes a signal into a linear combination of elements 
from a dictionary. In sparse recovery, this dictionary is merely the sampling 
matrix @ € R™™* we seek a sparse representation (x) of our “signal” y. 


MP is conceptually very simple. A key quantity in MP is the residual 

r € R™; the residual represents the as-yet “unexplained” portion of the 
measurements. At each iteration of the algorithm, we select a vector from 
the dictionary that is maximally correlated with the residual r: 


Equation: 


k =argmax ———— — 
A | ea ll 


Once this column is selected, we possess a “better” representation of the 
signal, since a new coefficient indexed by A; has been added to our signal 
approximation. Thus, we update both the residual and the approximation as 
follows: 

Equation: 


(Tea) Dru) Prx 
Il pa, II? 
©), — EY, ag (Tr-1, Drx)« 


Ve TRL 


and repeat the iteration. A suitable stopping criterion is when the norm of r 
becomes smaller than some quantity. MP is described in pseudocode form 
below. 


Inputs: Measurement matrix ®, signal measurements 


Yy 
Outputs: Sparse signal & 
initialize: %=0, r=y, i=0. 
while halting criterion false do 
1.z2¢i+1 
2. b+ Gr {form residual signal estimate} 
3. £;< 21+ T(1) {update largest magnitude 
coefficient} 


4. r+ r—@z,; {update measurement residual} 
end while 


return £¢ @; : 


Although MP is intuitive and can find an accurate approximation of the 
signal, it possesses two major drawbacks: (i) it offers no guarantees in terms 


of recovery error; indeed, it does not exploit the special structure present in 
the dictionary ©; (ii) the required number of iterations required can be quite 
large. The complexity of MP is O( MNT) [link] , where T' is the number of 
MP iterations 


Orthogonal Matching Pursuit (OMP) 


Matching Pursuit (MP) can prove to be computationally infeasible for many 
problems, since the complexity of MP grows linearly in the number of 
iterations T’. By employing a simple modification of MP, the maximum 
number of MP iterations can be upper bounded as follows. At any iteration 
k, Instead of subtracting the contribution of the dictionary element with 
which the residual r is maximally correlated, we compute the projection of 
r onto the orthogonal subspace to the linear span of the currently selected 
dictionary elements. This quantity thus better represents the “unexplained” 
portion of the residual, and is subtracted from r to form a new residual, and 
the process is repeated. If @- is the submatrix formed by the columns of & 
selected at time step t¢, the following operations are performed: 

Equation: 


x, =argmin || y— Pox |l2, 


a, = Ppxt, 


Te =Y— Oot. 


These steps are repeated until convergence. This is known as Orthogonal 
Matching Pursuit (OMP) [link]. Tropp and Gilbert [link] proved that OMP 
can be used to recover a sparse signal with high probability using 
compressive measurements. The algorithm converges in at most 
iterations, where K is the sparsity, but requires the added computational cost 
of orthogonalization at each iteration. Indeed, the total complexity of OMP 
can be shown to be O(MNK). 


While OMP is provably fast and can be shown to lead to exact recovery, the 
guarantees accompanying OMP for sparse recovery are weaker than those 
associated with optimization techniques. In particular, the reconstruction 


guarantees are not uniform, i.e., it cannot be shown that a single 
measurement matrix with M = C'K log N rows can be used to recover 
every possible K —sparse signal with MZ = CK log N measurements. 
(Although it is possible to obtain such uniform guarantees when it is 
acceptable to take more measurements. For example, see [link].) Another 
issue with OMP is robustness to noise; it is unknown whether the solution 
obtained by OMP will only be perturbed slightly by the addition of a small 
amount of noise in the measurements. Nevertheless, OMP is an efficient 
method for CS recovery, especially when the signal sparsity K is low. A 
pseudocode representation of OMP is shown below. 


Inputs: Measurement matrix @, signal measurements 


y 
Outputs: Sparse representation 7 


Initialize: 0) =0, r=y, 2=9, i=0. 
while halting criterion false do 

1.%a¢i+l1 

2. be @' {form residual signal estimate} 

3. 2< QNUsupp(T(b,1)) {add index of 
residual's largest magnitude entry to signal 
support} 

A, B;|_,— Ppa, Zil oc — 0 {form signal 
estimate} 

5. r+ y—@2x; {update measurement residual} 
end while 
return &<¢ &; : 


Stagewise Orthogonal Matching Pursuit (StOMP) 


Orthogonal Matching Pursuit is ineffective when the signal is not very 
Sparse as the computational cost increases quadratically with the number of 
nonzeros K. In this setting, Stagewise Orthogonal Matching Pursuit 
(StOMP) [link] is a better choice for approximately sparse signals in a 
large-scale setting. 


StOMP offers considerable computational advantages over ¢; minimization 
and Orthogonal Matching Pursuit for large scale problems with sparse 
solutions. The algorithm starts with an initial residual rg = y and calculates 
the set of all projections 7 r;,_; at the k”” stage (as in OMP). However, 
instead of picking a single dictionary element, it uses a threshold parameter 
T to determine the next best set of columns of whose correlations with the 
current residual exceed 7. The new residual is calculated using a least 
Squares estimate of the signal using this expanded set of columns, just as 
before. 


Unlike OMP, the number of iterations in StOMP is fixed and chosen before 
hand; S = 10 is recommended in [link]. In general, the complexity of 
StOMP is O( KN log N) , a significant improvement over OMP. However, 
StOMP does not bring in its wake any reconstruction guarantees. StOMP 
also has moderate memory requirements compared to OMP where the 
orthogonalization requires the maintenance of a Cholesky factorization of 
the dictionary elements. 


Compressive Sampling Matching Pursuit (CoSaMP) 


Greedy pursuit algorithms (such as MP and OMP) alleviate the issue of 
computational complexity encountered in optimization-based sparse 
recovery, but lose the associated strong guarantees for uniform signal 
recovery, given a requisite number of measurements of the signal. In 
addition, it is unknown whether these greedy algorithms are robust to signal 
and/or measurement noise. 


There have been some recent attempts to develop greedy algorithms 
(Regularized OMP [link], [link], Compressive Sampling Matching Pursuit 
(CoSaMP) [link] and Subspace Pursuit [link]) that bridge this gap between 
(RIP), developed in the context of analyzing £, minimization, plays a 
central role in such algorithms. Indeed, if the matrix @ satisfies the RIP of 
order K, this implies that every subset of K columns of the matrix is 
approximately orthonormal. This property is used to prove strong 
convergence results of these greedy-like methods. 


One variant of such an approach is employed by the CoSaMP algorithm. An 
interesting feature of CoSaMP is that unlike MP, OMP and StOMP, new 
indices in a signal estimate can be added as well as deleted from the current 
set of chosen indices. In contrast, greedy pursuit algorithms suffer from the 
fact that a chosen index (or equivalently, a chosen atom from the dictionary 
@ remains in the signal representation until the end. A brief description of 
CoSaMP is as follows: at the start of a given iteration 2, suppose the signal 
estimate is %;_1. 


Form signal residual estimate: e < $7 r 

Find the biggest 2K coefficients of the signal residual e; call this set of 
indices 92. 

Merge supports: T < (2 U supp (Z;-1) 


Form signal estimate b by subspace projection: b|,, — bly, bl rc + O 
Prune 6 by retaining its K largest coefficients. Call this new estimate 


Xj. 
Update measurement residual: r <— y — $2;. 


This procedure is summarized in pseudocode form below. 


Inputs: Measurement matrix ®@, measurements y, 
Signal sparsity Kk 

Output: K-sparse approximation Z to true signal 
representation x 

Initialize: 7 =0, r=y; i=0 

while halting criterion false do 


1.2¢274+1 
2. e¢ G'r {form signal residual estimate} 
3. Q< supp(T(e,2K)) {prune signal residual 


estimate} 


4, T < QUsupp(%;_1) {merge supports} 

5. bp <—Glhy, Blpc {form signal estimate} 

6. 2; << T(b,K) {prune signal estimate} 

7. r<~y—@#2z; {update measurement residual} 


and whitlta 


CiluUu WIILLU 


return £+¢ &; : 


As discussed in [link], the key computational issues for CoSaMP are the 
formation of the signal residual, and the method used for subspace 
projection in the signal estimation step. Under certain general assumptions, 
the computational cost of CoSaMP can be shown to be O(MN), which is 
independent of the sparsity of the original signal. This represents an 
improvement over both greedy algorithms as well as convex methods. 


While CoSaMP arguably represents the state of the art in sparse recovery 
algorithm performance, it possesses one drawback: the algorithm requires 
prior knowledge of the sparsity K of the target signal. An incorrect choice 
of input sparsity may lead to a worse guarantee than the actual error 
incurred by a weaker algorithm such as OMP. The stability bounds 
accompanying CoSaMP ensure that the error due to an incorrect parameter 
choice is bounded, but it is not yet known how these bounds translate into 
practice. 


Iterative Hard Thresholding 


Iterative Hard Thresholding (IHT) is a well-known algorithm for solving 
nonlinear inverse problems. The structure of IHT is simple: starting with an 
initial estimate Zo, iterative hard thresholding (IHT) obtains a sequence of 
estimates using the iteration: 

Equation: 


ST (ee + oF (y — $Z;), K). 


In [link], Blumensath and Davies proved that this sequence of iterations 
converges to a fixed point 7; further, if the matrix @ possesses the RIP, they 
showed that the recovered signal % satisfies an instance-optimality 
guarantee of the type described earlier. The guarantees (as well as the proof 
technique) are reminiscent of the ones that are derived in the development 
of other algorithms such as ROMP and CoSaMP. 


Discussion 


While convex optimization techniques are powerful methods for computing 
Sparse representations, there are also a variety of greedy/iterative methods 
for solving such problems. Greedy algorithms rely on iterative 
approximation of the signal coefficients and support, either by iteratively 
identifying the support of the signal until a convergence criterion is met, or 
alternatively by obtaining an improved estimate of the sparse signal at each 
iteration by accounting for the mismatch to the measured data. Some greedy 
methods can actually be shown to have performance guarantees that match 
those obtained for convex optimization approaches. In fact, some of the 
more sophisticated greedy algorithms are remarkably similar to those used 
for 2; minimization described previously. However, the techniques required 
to prove performance guarantees are substantially different. There also exist 
iterative techniques for sparse recovery based on message passing schemes 
for sparse graphical models. In fact, some greedy algorithms (such as those 
in [link], [link]) can be directly interpreted as message passing 

methods [link]. 


Combinatorial algorithms 
This module introduces the count-min and count-median sketches as 
representative examples of combinatorial algorithms for sparse recovery. 


In addition to convex optimization and greedy_pursuit approaches, there is 
another important class of sparse recovery algorithms that we will refer to 
as combinatorial algorithms. These algorithms, mostly developed by the 
theoretical computer science community, in many cases pre-date the 
compressive sensing literature but are highly relevant to the sparse signal 
recovery_problem. 


Setup 


The oldest combinatorial algorithms were developed in the context of group 
testing [link], [link], [link]. In the group testing problem, we suppose that 
there are NN total items, of which an unknown subset of K elements are 
anomalous and need to be identified. For example, we might wish to 
identify defective products in an industrial setting, or identify a subset of 
diseased tissue samples in a medical context. In both of these cases the 
vector x indicates which elements are anomalous, i.e., x; ~ 0 for the KX 
anomalous elements and z; = 0 otherwise. Our goal is to design a 
collection of tests that allow us to identify the support (and possibly the 
values of the nonzeros) of xz while also minimizing the number of tests 
performed. In the simplest practical setting these tests are represented by a 
binary matrix & whose entries ;; are equal to 1 if and only if the tb item is 
used in the i*" test. If the output of the test is linear with respect to the 
inputs, then the problem of recovering the vector z is essentially the same 
as the standard sparse recovery problem. 


Another application area in which combinatorial algorithms have proven 
useful is computation on data streams [link], [link]. Suppose that x; 
represents the number of packets passing through a network router with 
destination 2. Simply storing the vector z is typically infeasible since the 
total number of possible destinations (represented by a 32-bit IP address) is 
N = 2°. Thus, instead of attempting to store x directly, one can store 

y = &x where @ is an M x N matrix with M < N. In this context the 
vector y is often called a sketch. Note that in this problem y is computed in 


a different manner than in the compressive sensing context. Specifically, in 
the network traffic example we do not ever observe x; directly; rather, we 
observe increments to z; (when a packet with destination 2 passes through 
the router). Thus we construct y iteratively by adding the z*” column to y 
each time we observe an increment to x;, which we can do since y = @z is 
linear. When the network traffic is dominated by traffic to a small number 
of destinations, the vector x is compressible, and thus the problem of 
recovering x from the sketch @z is again essentially the same as the sparse 
recovery problem. 


Several combinatorial algorithms for sparse recovery have been developed 
in the literature. A non-exhaustive list includes Random Fourier 

Sampling [link], HHS Pursuit [link], and Sparse Sequential Matching 
Pursuit [link]. We do not provide a full discussion of each of these 
algorithms; instead, we describe two simple methods that highlight the 
flavors of combinatorial sparse recovery — count-min and count-median. 


The count-min sketch 


Define H as the set of all discrete-valued functions 

h: {1,...,N}— {1,...,m}. Note that H is a finite set of size m. Each 
function h € H can be specified by a binary characteristic matrix p(h) of 
size m x NN, with each column being a binary vector with exactly one 1 at 
the location 7 , where 7 = h(i). To construct the overall sampling matrix ®, 
we choose d functions hj, ..., hg independently from the uniform 
distribution defined on H, and vertically concatenate their characteristic 
matrices. Thus, if M = md, € is a binary matrix of size M x N with each 
column containing exactly d ones. 


Now given any signal x, we acquire linear measurements y = @z. It is easy 
to visualize the measurements via the following two properties. First, the 
coefficients of the measurement vector y are naturally grouped according to 
the “mother” binary functions {hj, ..., Aa}. Second, consider the ae 
coefficient of the measurement vector y, which corresponds to the mother 
binary function h. Then, the expression for y; is simply given by: 
Equation: 


In other words, for a fixed signal coefficient index 7, each measurement y; 
as expressed above consists of an observation of x; corrupted by other 
signal coefficients mapped to the same 2 by the function h. Signal recovery 
essentially consists of estimating the signal values from these “corrupted" 
observations. 


The count-min algorithm is useful in the special case where the entries of 
the original signal are positive. Given measurements y using the sampling 
matrix ® as constructed above, the estimate of the j‘" signal entry is given 
by: 

Equation: 


Ly =min yp hpg) =. 


Intuitively, this means that the estimate of x; is formed by simply looking at 
all measurements that comprise of x; corrupted by other signal values, and 
picking the one with the lowest magnitude. Despite the simplicity of this 
algorithm, it is accompanied by an arguably powerful instance-optimality 
guarantee: if d = C log N and m = 4/aK, then with high probability, the 
recovered signal Z satisfies: 

Equation: 


as * 
| t—-Z ||o<a/K-||e#—z hj, 


where x represents the best K-term approximation of z in the 2, sense. 


The count-median sketch 


For the general setting when the coefficients of the original signal could be 
either positive or negative, a similar algorithm known as count-median can 
be used. Instead of picking the minimum of the measurements, we compute 


the median of all those measurements that are comprised of a corrupted 
version of x; and declare it as the signal coefficient estimate, i.e., 
Equation: 


a =median grt hyg) =. 


The recovery guarantees for count-median are similar to that for count-min, 
with a different value of the failure probability constant. An important 
feature of both count-min and count-median is that they require that the 
measurements be perfectly noiseless, in contrast to optimization/greedy 
algorithms which can tolerate small amounts of measurement noise. 


Summary 


Although we ultimately wish to recover a sparse signal from a small 
number of linear measurements in both of these settings, there are some 
important differences between such settings and the compressive sensing 
setting studied in this course. First, in these settings it is natural to assume 
that the designer of the reconstruction algorithm also has full control over @ 
, and is thus free to choose @ in a manner that reduces the amount of 
computation required to perform recovery. For example, it is often useful to 
design @ so that it has few nonzeros, i.e., the sensing matrix itself is also 
sparse [link], [link], [link]. In general, most methods involve careful 
construction of the sensing matrix ®, which is in contrast with the 
optimization and greedy methods that work with any matrix satisfying a 


degree of freedom can lead to significantly faster algorithms [link], [link], 
[link], [Link]. 


Second, note that the computational complexity of all the convex methods 
and greedy algorithms described above is always at least linear in NV, since 
in order to recover x we must at least incur the computational cost of 
reading out all NV entries of z. This may be acceptable in many typical 
compressive sensing applications, but this becomes impractical when JN is 
extremely large, as in the network monitoring example. In this context, one 
may seek to develop algorithms whose complexity is linear only in the 


length of the representation of the signal, i.e., its sparsity A’. In this case the 
algorithm does not return a complete reconstruction of x but instead returns 
only its kK largest elements (and their indices). As surprising as it may 
seem, such algorithms are indeed possible. See [link], [link] for examples. 


Bayesian methods 
This module provides an overview of the application of Bayesian methods 
to compressive sensing and sparse recovery. 


Setup 


Throughout this course, we have almost exclusively worked within a 
deterministic signal framework. In other words, our signal z is fixed and 
belongs to a known set of signals. In this section, we depart from this 
framework and assume that the sparse (or compressible) signal of interest 
arises from a known probability distribution, i.e., we assume sparsity 
promoting priors on the elements of z, and recover from the stochastic 
measurements y = @z a probability distribution on each nonzero element 
of x. Such an approach falls under the purview of Bayesian methods for 
sparse recovery. 


The algorithms discussed in this section demonstrate a digression from the 
conventional sparse recovery techniques typically used in compressive 
sensing (CS). We note that none of these algorithms are accompanied by 
guarantees on the number of measurements required, or the fidelity of 
signal reconstruction; indeed, in a Bayesian signal modeling framework, 
there is no well-defined notion of “reconstruction error”. However, such 
methods do provide insight into developing recovery algorithms for rich 
classes of signals, and may be of considerable practical interest. 


Sparse recovery via belief propagation 


As we will see later in this course, there are significant parallels to be drawn 
between error correcting codes and sparse recovery [link]. In particular, 
sparse codes such as LDPC codes have had grand success. The advantage 
that sparse coding matrices may have in efficient encoding of signals and 
their low complexity decoding algorithms, is transferable to CS encoding 
and decoding with the use of sparse sensing matrices ®. The sparsity in the 
@ matrix is equivalent to the sparsity in LDPC coding graphs. 
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A sensing matrix & that defines the relation between the signal z and 
measurements y can be represented as a bipartite graph of signal coefficient 
nodes x(z) and measurement nodes y(2) [link], [link]. The factor graph in 
[link] represents the relationship between the signal coefficients and 
measurements in the CS decoding problem. 


The choice of signal probability density is of practical interest. In many 
applications, the signals of interest need to be modeled as being 
compressible (as opposed to being strictly sparse). This behavior is modeled 
by a two-state Gaussian mixture distribution, with each signal coefficient 
taking either a “large” or “small” coefficient value state. Assuming that the 
elements of x are i.i.d., it can be shown that small coefficients occur more 
frequently than the large coefficients. Other distributions besides the two- 
state Gaussian may also be used to model the coefficients, for e.g., the i.i.d. 
Laplace prior on the coefficients of x. 


The ultimate goal is to estimate (i.e., decode) x, given y and &. The 
decoding problem takes the form of a Bayesian inference problem in which 


we want to approximate the marginal distributions of each of the (2) 
coefficients conditioned on the observed measurements y(z). We can then 
estimate the Maximum Likelihood Estimate (MLE), or the Maximum a 
Posteriori (MAP) estimates of the coefficients from their distributions. This 
sort of inference can be solved using a variety of methods; for example, the 
popular belief propagation method (BP) [link] can be applied to solve for 
the coefficients approximately. Although exact inference in arbitrary 
graphical models is an NP hard problem, inference using BP can be 
employed when @ is sparse enough, i.e., when most of the entries in the 
matrix are equal to zero. 


Sparse Bayesian learning 


Another probabilistic approach used to estimate the components of z is by 
using Relevance Vector Machines (RVMs). An RVM is essentially a 
Bayesian learning method that produces sparse classification by linearly 
weighting a small number of fixed basis functions from a large dictionary of 
potential candidates (for more details the interested reader may refer 

to [link], [link]). From the CS perspective, we may view this as a method to 
determine the elements of a sparse x which linearly weight the basis 
functions comprising the columns of @. 


The RVM setup employs a hierarchy of priors; first, a Gaussian prior is 
assigned to each of the NV elements of x; subsequently, a Gamma prior is 
assigned to the inverse-variance a; of the i** Gaussian prior. Therefore 
each a; controls the strength of the prior on its associated weight in z;. If x 
is the sparse vector to be reconstructed, its associated Gaussian prior is 
given by: 

Equation: 


N 
p(a\a) = I] 2; 0,a;' 


and the Gamma prior on @ is written as: 
Equation: 


N 
p (ala, b) = Il I (a;\a, b) 


i=1 


The overall prior on x can be analytically evaluated to be the Student-t 
distribution, which can be designed to peak at x; = 0 with appropriate 
choice of a and b. This enables the desired solution x to be sparse. The 
RVM approach can be visualized using a graphical model similar to the one 
in "Sparse recovery via belief propagation". Using the observed 
measurements y, the posterior density on each z; is estimated by an 
iterative algorithm (e.g., Markov Chain Monte Carlo (MCMC) methods). 
For a detailed analysis of the RVM with a measurement noise prior, refer 

to [link], [Link]. 


Alternatively, we can eliminate the need to set the hyperparameters a and 6 
as follows. Assuming Gaussian measurement noise with mean 0 and 
variance a”, we can directly find the marginal log likelihood for a and 
maximize it by the EM algorithm (or directly differentiate) to find estimates 
for a. 

Equation: 


(a) =log p ya,o* =log Je y 2,0° p(yla)dz. 


Bayesian compressive sensing 


Unfortunately, evaluation of the log-likelihood in the original RVM setup 
involves taking the inverse of an NV x N matrix, rendering the algorithm's 
complexity to be O N® . A fast alternative algorithm for the RVM is 
available which monotonically maximizes the marginal likelihoods of the 
priors by a gradient ascent, resulting in an algorithm with complexity 

O NM? . Here, basis functions are sequentially added and deleted, thus 
building the model up constructively, and the true sparsity of the signal z is 
exploited to minimize model complexity. This is known as Fast Marginal 
Likelihood Maximization, and is employed by the Bayesian Compressive 


Sensing (BCS) algorithm [link] to efficiently evaluate the posterior 
densities of x;. 


A key advantage of the BCS algorithm is that it enables evaluation of “error 
bars” on each estimated coefficient of x; these give us an idea of the 
(in)accuracies of these estimates. These error bars could be used to 
adaptively select the linear projections (i.e., the rows of the matrix ®) to 
reduce uncertainty in the signal. This provides an intriguing connection 
between CS and machine learning techniques such as experimental design 
and active learning [link], [link]. 


Linear regression and model selection 

This module provides a brief overview of the relationship between model 
selection, sparse linear regression, and the techniques developed in 
compressive sensing. 


Many of the sparse recovery algorithms we have described so far in this 
course were originally developed to address the problem of sparse linear 
regression and model selection in statistics. In this setting we are given 
some data consisting of a set of input variables and response variables. We 
will suppose that there are a total of NV input variables, and we observe a 
total of M input and response pairs. We can represent the set of input 
variable observations as an M x N matrix @, and the set of response 
variable observations as an M x 1 vector y. 


In linear regression, it is assumed that y can be approximated as a linear 
function of the input variables, i.e., there exists an x such that y + @z. 
However, when the number of input variables is large compared to the 
number of observations, i.e., MJ < N, this becomes extremely challenging 
because we wish to estimate NV parameters from far fewer than NV 
observations. In general this would be impossible to overcome, but in 
practice it is common that only a few input variables are actually necessary 
to predict the response variable. In this case the z that we wish to estimate 
is sparse, and we can apply all of the techniques that we have learned so far 
for sparse recovery to estimate x. In this setting, not only does sparsity aid 
us in our goal of obtaining a regression, but it also performs model selection 
by identifying the most relevant variables in predicting the response. 


Sparse error correction 

This module illustrates the application of the ideas of compressive sensing 
to the design and decoding of error correcting codes for vectors of real 
numbers subject to sparse corruptions. 


In communications, error correction refers to mechanisms that can detect 
and correct errors in the data that appear duet to distortion in the 
transmission channel. Standard approaches for error correction rely on 
repetition schemes, redundancy checks, or nearest neighbor code search. 
We consider the particular case in which a signal x with MM entries is coded 
by taking length-NV linearly independent codewords {1,.... par}, with 

N > M and summing them using the entries of x as coefficients. The 
received message is a length-NV code y = pe y;x; = Pf, where Pisa 
matrix that has the different codewords for columns. We assume that the 
transmission channel corrupts the entries of y in an additive way, so that the 
received data is y= x + e, where e is an error vector. 


The techniques developed for sparse recovery. in the context of compressive 
sensing (CS) provide a number of methods to estimate the error vector e — 
therefore making it possible to correct it and obtain the signal « — when e 
is sufficiently sparse [link]. To estimate the error, we build a matrix O that 
is a basis for the orthogonal subspace to the span of the matrix ®, i.e., an 
(N — M) x N matrix O that holds OS = 0. When such a matrix is 
obtained, we can modify the measurements by multiplying them with the 
matrix to obtain y = Oy = Ox + Oe = Oe. If the matrix O is well- 
suited for CS (i.e., it satisfies a condition such as the restricted isometry 
property) and e is sufficiently sparse, then the error vector e can be 
estimated accurately using CS. Once the estimate é is obtained, the error- 
free measurements can be estimated as y = y — é, and the signal can be 
recovered as = big = Sly — Hié. As an example, when the codewords 
(Ym have random independent and identically distributed sub-Gaussian 
entries, then a K-sparse error can be corrected if M < N—CK log N/K 
for a fixed constant C’ (see '"Matrices that satisfy the RIP"). 


Group testing and data stream algorithms 

This module provides an overview of the relationship between compressive 
sensing and problems in theoretical computer science including 
combinatorial group testing and computation on data streams. 


Another scenario where compressive sensing and sparse recovery. 
algorithms can be potentially useful is the context of group testing and the 
related problem of computation on data streams. 


Group testing 


Among the historically oldest of all sparse recovery algorithms were 
developed in the context of combinatorial group testing [link], [link], [link]. 
In this problem we suppose that there are NV total items and K anomalous 
elements that we wish to find. For example, we might wish to identify 
defective products in an industrial setting, or identify a subset of diseased 
tissue samples in a medical context. In both of these cases the vector x 
indicates which elements are anomalous, i.e., x; ~ 0 for the K anomalous 
elements and 2; = 0 otherwise. Our goal is to design a collection of tests 
that allow us to identify the support (and possibly the values of the 
nonzeros) of x while also minimizing the number of tests performed. In the 
simplest practical setting these tests are represented by a binary matrix & 
whose entries ;; are equal to 1 if and only if the * item is used in the i*® 
test. If the output of the test is linear with respect to the inputs, then the 
problem of recovering the vector z is essentially the same as the standard 
Sparse recovery problem in compressive sensing. 


Computation on data streams 


Another application area in which ideas related to compressive sensing 
have proven useful is computation on data streams [link], [link]. As an 
example of a typical data streaming problem, suppose that x; represents the 
number of packets passing through a network router with destination 7. 
Simply storing the vector z is typically infeasible since the total number of 
possible destinations (represented by a 32-bit IP address) is N = 2°”. Thus, 
instead of attempting to store x directly, one can store y = x where @ is 


an M x N matrix with M < N. In this context the vector y is often called 
a sketch. Note that in this problem y is computed in a different manner than 
in the compressive sensing context. Specifically, in the network traffic 
example we do not ever observe x; directly, rather we observe increments to 
x; (when a packet with destination 2 passes through the router). Thus we 
construct y iteratively by adding the 7" column to y each time we observe 
an increment to z;, which we can do since y = @z is linear. When the 
network traffic is dominated by traffic to a small number of destinations, 
the vector x is compressible, and thus the problem of recovering x from the 
sketch @z is again essentially the same as the sparse recovery problem in 
compressive sensing. 


Compressive medical imaging 
This module describes the application of compressive sensing to problems 
in medical imaging. 


MR image reconstruction 


Magnetic Resonance Imaging (MRI) is a medical imaging technique based 
on the core principle that protons in water molecules in the human body 
align themselves in a magnetic field. MRI machines repeatedly pulse 
magnetic fields to cause water molecules in the human body to disorient 
and then reorient themselves, which causes a release of detectable 
radiofrequencies. We assume that the object to be imaged as a collection of 
voxels. The MRI's magnetic pulses are sent incrementally along a gradient 
leading to a different phase and frequency encoding for each column and 
row of voxels respectively. Abstracting away from the technicalities of the 
physical process, the magnetic field measured in MRI acquisition 
corresponds to a Fourier coefficient of the imaged object; the object can 
then be recovered by an inverse Fourier transform. , we can view the MRI 
as measuring Fourier samples. 


A major limitation of the MRI process is the linear relation between the 
number of measured data samples and scan times. Long-duration MRI 
scans are more susceptible to physiological motion artifacts, add discomfort 
to the patient, and are expensive [link]. Therefore, minimizing scan time 
without compromising image quality is of direct benefit to the medical 
community. 


The theory of compressive sensing (CS) can be applied to MR image 
reconstruction by exploiting the transform-domain sparsity of MR 

images [link], [link], [link], [link]. In standard MRI reconstruction, 
undersampling in the Fourier domain results in aliasing artifacts when the 
image is reconstructed. However, when a known transform renders the 
object image sparse or compressible, the image can be reconstructed using 
sparse recovery. methods. While the discrete cosine and wavelet transforms 
are commonly used in CS to reconstruct these images, the use of total 
variation norm minimization also provides high-quality reconstruction. 


Electroencephalography 


Electroencephalography (EEG) and Magnetoencephalography (MEG) are 
two popular noninvasive methods to characterize brain function by 
measuring scalp electric potential distributions and magnetic fields due to 
neuronal firing. EEG and MEG provide temporal resolution on the 
millisecond timescale characteristic of neural population activity and can 
also help to estimate the current sources inside the brain by solving an 
inverse problem [link]. 


Models for neuromagnetic sources suggest that the underlying activity is 
often limited in spatial extent. Based on this idea, algorithms like FOCUSS 
(Focal Underdetermined System Solution) are used to identify highly 
localized sources by assuming a sparse model to solve an underdetermined 
problem [link]. 


FOCUSS is a recursive linear estimation procedure, based on a weighted 
pseudo-inverse solution. The algorithm assigns a current (with nonlinear 
current location parameters) to each element within a region so that the 
unknown current values can be related linearly to the measurements. The 
weights at each step are derived from the solution of the previous iterative 
step. The algorithm converges to a source distribution in which the number 
of parameters required to describe source currents does not exceed the 
number of measurements. The initialization determines which of the 
localized solutions the algorithm converges to. 


Analog-to-information conversion 

In this module we describe the random demodulator and how it can be used 
in the application of the theory of compressive sensing to the problem of 
acquiring a high-bandwidth continuous-time signal. 


We now consider the application of compressive sensing (CS) to the 
problem of designing a system that can acquire a continuous-time signal 
x(t). Specifically, we would like to build an analog-to-digital converter 
(ADC) that avoids having to sample x(t) at its Nyquist rate when x(t) is 
sparse. In this context, we will assume that x(t) has some kind of sparse 
structure in the Fourier domain, meaning that it is still bandlimited but that 
much of the spectrum is empty. We will discuss the different possible signal 
models for mathematically capturing this structure in greater detail below. 
For now, the challenge is that our measurement system must be built using 
analog hardware. This imposes severe restrictions on the kinds of 
operations we can perform. 


Analog measurement model 


To be more concrete, since we are dealing with a continuous-time signal 
z(t), we must also consider continuous-time test functions {y; Oy: We 
then consider a finite window of time, say t € [0, T], and would like to 


collect IZ measurements of the form 
Equation: 


v= ey; (at. 


Building an analog system to collect such measurements will require three 
main components: 


1. hardware for generating the test signals y, (t); 
2. M correlators that multiply the signal x(t) with each respective ; (t); 
3. M integrators with a zero-valued initial state. 


We could then sample and quantize the output of each of the integrators to 
collect the measurements y|7]. Of course, even in this somewhat idealized 
setting, it should be clear that what we can build in hardware will constrain 
our choice of y, (t) since we cannot reliably and accurately produce (and 
reproduce) arbitrarily complex y, (t) in analog hardware. Moreover, the 
architecture described above requires M/ correlator/integrator pairs 
operating in parallel, which will be potentially prohibitively expensive both 
in dollar cost as well as costs such as size, weight, and power (SWAP). 


As a result, there have been a number of efforts to design simpler 
architectures, chiefly by carefully designing structured y; (t). The simplest 
to describe and historically earliest idea is to choose y; (t) = 6 (t — t,), 


where {t et denotes a sequence of JM locations in time at which we 


would like to sample the signal x(t). Typically, if the number of 
measurements we are acquiring is lower than the Nyquist-rate, then these 
locations cannot simply be uniformly spaced in the interval [0,7], but must 
be carefully chosen. Note that this approach simply requires a single 
traditional ADC with the ability to sample on a non-uniform grid, avoiding 
the requirement for / parallel correlator/integrator pairs. Such non- 
uniform sampling systems have been studied in other contexts outside of 
the CS framework. For example, there exist specialized fast algorithms for 
the recovery of extremely large Fourier-sparse signals. The algorithm uses 
samples at a non-uniform sequence of locations that are highly structured, 
but where the initial location is chosen using a (pseudo)random seed. This 
literature provides guarantees similar to those available from standard 

CS [link], [link]. Additionally, there exist frameworks for the sampling and 
recovery of multi-band signals, whose Fourier transforms are mostly zero 
except for a few frequency bands. These schemes again use non-uniform 
sampling patterns based on coset sampling [link], [link], [link], [link], 
[link], [link]. Unfortunately, these approaches are often highly sensitive to 
jitter, or error in the timing of when the samples are taken. 


We will consider a rather different approach, which we call the random 
demodulator [link], [link], [link].[footnote] The architecture of the random 
demodulator is depicted in [link]. The analog input x(t) is correlated with a 
pseudorandom square pulse of +1's, called the chipping sequence p, (t), 


which alternates between values at a rate of NV.,Hz, where N,,Hz is at least 
as fast as the Nyquist rate of x(t). The mixed signal is integrated over a 
time period 1/M, and sampled by a traditional integrate-and-dump back- 
end ADC at M,Hz < N,Hz. In this case our measurements are given by 
A correlator is also known as a “demodulator” due to its most common 
application: demodulating radio signals. 

Equation: 


3/Ma 
yljl = | eed: 


j-1)/Ma 


In practice, data is processed in time blocks of period 7’, and we define 

N = N,T as the number of elements in the chipping sequence, and 

M = M,T as the number of measurements. We will discuss the 
discretization of this model below, but the key observation is that the 
correlator and chipping sequence operate at a fast rate, while the back-end 
ADC operates at a low rate. In hardware it is easier to build a high-rate 
modulator/chipping sequence combination than a high-rate ADC [link]. In 
fact, many systems already use components of this front end for binary 
phase shift keying demodulation, as well as for other conventional 
communication schemes such as CDMA. 
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Random demodulator block diagram. 


Discrete formulation 


Although the random demodulator directly acquires compressive 
measurements without first sampling x(t), it is equivalent to a system 
which first samples x(t) at its Nyquist-rate to yield a discrete-time vector z, 
and then applies a matrix @ to obtain the measurements y = @z. To see this 
we let p, [n| denote the sequence of +1 used to generate the signal p, (t), 
i.e., Dc (t) = p- [n| for t € [(n — 1)/Na, n/N]. As an example, consider 
the first measurement, or the case of 7 = 1. In this case, t € [0,1/M,], so 
that p, (t) is determined by p, [n] for n = 1, 2,..., Na/Ma. Thus, from 
[link] we obtain 

Equation: 


But since N, is the Nyquist-rate of x(t), f, ss 7 in, © (t) dt simply 


calculates the average value of a(t) on the n™ interval, yielding a sample 
denoted x[n]. Thus, we obtain 
Equation: 


In general, our measurement process is equivalent to multiplying the signal 
zx with the random sequence of +1's in p, [n] and then summing every 
sequential block of N/M, coefficients. We can represent this as a banded 


matrix ® containing N/M, pseudorandom +1s per row. For example, 
with N = 12, M = 4, and T = 1, sucha @ is expressed as 
Equation: 
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In general, & will have M rows and each row will contain N/M nonzeros. 
Note that matrices satisfying this structure are extremely efficient to apply, 
requiring only O(V) computations compared to O(M N) in the general 
case. This is extremely useful during recovery. 


A detailed analysis of the random demodulator in [link] studied the 
properties of these matrices applied to a particular signal model. 
Specifically, it is shown that if W represents the NV x N normalized discrete 
Fourier transform (DFT) matrix, then the matrix @W will satisfy the 


Equation: 


M = O(K log’ (N/K)), 


where the probability is taken with respect to the random choice of p, |n]. 
This means that if x(t) is a periodic (or finite-length) signal such that once 
it is sampled it is sparse or compressible in the basis W, then it should be 
possible to recover x(t) from the measurements provided by the random 
demodulator. Moreover, it is empirically demonstrated that combining @; 
minimization with the random demodulator can recover /K-sparse (in W) 
signals with 

Equation: 


M > CK log (N/K +1) 


measurements where C’ © 1. 7 [link]. 


Note that the signal model considered in [link] is somewhat restrictive, 
since even a pure tone will not yield a sparse DFT unless the frequency 
happens to be equal to k/N, for some integer k. Perhaps a more realistic 
signal model is the multi-band signal model of [link], [link], [link], [link], 
[link], [link], where the signal is assumed to be bandlimited outside of 
bands each of bandwidth B, where K B is much less than the total possible 
bandwidth. It remains unknown whether the random demodulator can be 
exploited to recover such signals. Moreover, there also exist other CS- 
inspired architectures that we have not explored in this [link], [link], [link], 
and this remains an active area of research. We have simply provided an 
overview of one of the more promising approaches in order to illustrate the 
potential applicability of the ideas of this course to the problem of analog- 
to-digital conversion. 


Single-pixel camera 
This module describes the application of compressive sensing to the design 
of a novel imaging architecture called the "single-pixel camera". 


Architecture 


Several hardware architectures have been proposed that apply the theory of 
compressive sensing (CS) in an imaging setting [link], [link], [link]. We 
will focus on the so-called single-pixel camera [link], [link], [link], [link], 
[link]. The single-pixel camera is an optical computer that sequentially 
measures the inner products y |7] = (x, y;) between an NV-pixel sampled 
version of the incident light-field from the scene under view (denoted by x) 
and a set of N-pixel test functions {y ee The architecture is illustrated 


in [link], and an aerial view of the camera in the lab is shown in [link]. As 
shown in these figures, the light-field is focused by a lens (Lens 1 in [link]) 
not onto a CCD or CMOS sampling array but rather onto a spatial light 
modulator (SLM). An SLM modulates the intensity of a light beam 
according to a control signal. A simple example of a transmissive SLM that 
either passes or blocks parts of the beam is an overhead transparency. 
Another example is a liquid crystal display (LCD) projector. 
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Single-pixel camera block diagram. Incident light-field 
(corresponding to the desired image 2) is reflected off a digital 
micromirror device (DMD) array whose mirror orientations are 
modulated according to the pseudorandom pattern y; supplied 


by a random number generator. Each different mirror pattern 
produces a voltage at the single photodiode that corresponds to 
one measurement y[j]. 


The Texas Instruments (TI) digital micromirror device (DMD) is a 
reflective SLM that selectively redirects parts of the light beam. The DMD 
consists of an array of bacterium-sized, electrostatically actuated micro- 
mirrors, where each mirror in the array is suspended above an individual 
static random access memory (SRAM) cell. Each mirror rotates about a 
hinge and can be positioned in one of two states (+10 degrees from 
horizontal) according to which bit is loaded into the SRAM cell; thus light 
falling on the DMD can be reflected in two directions depending on the 
orientation of the mirrors. 


Each element of the SLM corresponds to a particular element of y; (and its 
corresponding pixel in x). For a given yj, we can orient the corresponding 
element of the SLM either towards (corresponding to a 1 at that element of 
(pj) or away from (corresponding to a 0 at that element of y;) a second lens 
(Lens 2 in [link]). This second lens collects the reflected light and focuses it 
onto a single photon detector (the single pixel) that integrates the product of 
z and y, to compute the measurement y [7] = (x, y;) as its output voltage. 
This voltage is then digitized by an A/D converter. Values of y; between 0 
and 1 can be obtained by dithering the mirrors back and forth during the 
photodiode integration time. By reshaping zx into a column vector and the 
(py; into row vectors, we can thus model this system as computing the 
product y = @z, where each row of & corresponds to a y;. To compute 
randomized measurements, we set the mirror orientations y,; randomly 
using a pseudorandom number generator, measure y|7], and then repeat the 
process M times to obtain the measurement vector y. 


DMD+ALP Board 
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Aerial view of the single-pixel camera in the lab. 


The single-pixel design reduces the required size, complexity, and cost of 
the photon detector array down to a single unit, which enables the use of 
exotic detectors that would be impossible in a conventional digital camera. 
Example detectors include a photomultiplier tube or an avalanche 
photodiode for low-light (photon-limited) imaging, a sandwich of several 
photodiodes sensitive to different light wavelengths for multimodal sensing, 
a spectrometer for hyperspectral imaging, and so on. 


In addition to sensing flexibility, the practical advantages of the single-pixel 
design include the facts that the quantum efficiency of a photodiode is 
higher than that of the pixel sensors in a typical CCD or CMOS array and 
that the fill factor of a DMD can reach 90% whereas that of a CCD/CMOS 
array is only about 50%. An important advantage to highlight is that each 
CS measurement receives about N/2 times more photons than an average 
pixel sensor, which significantly reduces image distortion from dark noise 
and read-out noise. 


The single-pixel design falls into the class of multiplex cameras. The 
baseline standard for multiplexing is classical raster scanning, where the 
test functions {y;} are a sequence of delta functions 6|n — J] that turn on 
each mirror in turn. There are substantial advantages to operating in a CS 
rather than raster scan mode, including fewer total measurements (JV for 
CS rather than N for raster scan) and significantly reduced dark noise. 
See [link] for a more detailed discussion of these issues. 


[link] (a) and (b) illustrates a target object (a black-and-white printout of an 
“R”) x and reconstructed image & taken by the single-pixel camera 
prototype in [link] using N = 256 x 256 and M = N/50[link]. [link](c) 
illustrates an N = 256 x 256 color single-pixel photograph of a printout of 
the Mandrill test image taken under low-light conditions using RGB color 
filters and a photomultiplier tube with M = N/10. In both cases, the 
images were reconstructed using total variation minimization, which is 
closely related to wavelet coefficient 2; minimization [link]. 


Sample image reconstructions from single-pixel camera. (a) 

256 x 256 conventional image of a black-and-white “R”. (b) Image 
reconstructed from M = 1300 single-pixel camera measurements ( 
50x sub-Nyquist). (c) 256 x 256 pixel color reconstruction of a 
printout of the Mandrill test image imaged in a low-light setting using 
a single photomultiplier tube sensor, RGB color filters, and MZ = 6500 
random measurements. 


Discrete formulation 


Since the DMD array is programmable, we can employ arbitrary test 
functions y;. However, even when we restrict the y, to be {0, 1}-valued, 
storing these patterns for large values of NV is impractical. Furthermore, as 
noted above, even pseudorandom @ can be computationally problematic 
during recovery. Thus, rather than purely random @, we can also consider 
that admit a fast transform-based implementation by taking random 
submatrices of a Walsh, Hadamard, or noiselet transform [link], [link]. We 
will describe the Walsh transform for the purpose of illustration. 


We will suppose that NV is a power of 2 and let Wiog, 7 denote the N x N 
Walsh transform matrix. We begin by setting Wo = 1, and we now define 
W;; recursively as 

Equation: 
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This construction produces an orthonormal matrix with entries of +1/ VN 
that admits a fast implementation requiring O(N log N) computations to 
apply. As an example, note that 

Equation: 


and 
Equation: 
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We can exploit these constructions as follows. Suppose that N = 2? and 
generate Wz. Let [pr denote a M x N submatrix of the identity J obtained 
by picking a random set of / rows, so that J->Wg is the submatrix of Wp 
consisting of the rows of Wg indexed by J’. Furthermore, let D denote a 
random NV x N permutation matrix. We can generate @ as 

Equation: 


i 1 
®= (Sve + 5). 


Note that i/N IlrWet+ - merely rescales and shifts [pW gz to have 

{0, 1}-valued entries, and recall that each row of @ will be reshaped into a 
2-D matrix of numbers that is then displayed on the DMD array. 
Furthermore, D can be thought of as either permuting the pixels or 
permuting the columns of Wz. This step adds some additional randomness 
since some of the rows of the Walsh matrix are highly correlated with 
coarse scale wavelet basis functions — but permuting the pixels eliminates 
this structure. Note that at this point we do not have any strict guarantees 
that such & combined with a wavelet basis W will yield a product 6W 


well in practice. 


Hyperspectral imaging 
This module provides an overview of architectures and methods for 
hyperspectral imaging using the ideas of compressive sensing. 


Standard digital color images of a scene of interest consist of three 
components — red, green and blue — which contain the intensity level for 
each of the pixels in three different groups of wavelengths. This concept has 
been extended in the hyperspectral and multispectral imaging sensing 
modalities, where the data to be acquired consists of a three-dimensional 
datacube that has two spatial dimensions z and y and one spectral 
dimension A. 


In simple terms, a datacube is a 3-D function f(a, y, A) that can be 
represented as a stacking of intensities of the scene at different wavelengths. 
An example datacube is shown in [link]. Each of its entries is called a 
voxel. We also define a pixel's spectral signature as the stacking of its 
voxels in the spectral dimension f (x, y) = {f (x, y, A) }). The spectral 
signature of a pixel can give a wealth of information about the 
corresponding point in the scene that is not captured by its color. For 
example, using spectral signatures, it is possible to identify the type of 
material observed (for example, vegetation vs. ground vs. water), or its 
chemical composition. 


Datacubes are high-dimensional, since the standard number of pixels 
present in a digitized image is multiplied by the number of spectral bands 
desired. However, considerable structure is present in the observed data. 
The spatial structure common in natural images is also observed in 
hyperspectral imaging, while each pixel's spectral signature is usually 
smooth. 


Example hyperspectral datacube, with 
labeled dimensions. 


Compressive sensing (CS) architectures for hyperspectral imaging perform 
lower-dimensional projections that multiplex in the spatial domain, the 
spectral domain, or both. Below, we detail three example architectures, as 
well as three possible models to sparsify hyperspectral datacubes. 


Compressive hyperspectral imaging architectures 


Single pixel hyperspectral camera 


The single pixel camera uses a single photodetector to record random 
projections of the light emanated from the image, with the different random 
projections being captured in sequence. A single pixel hyperspectral camera 
requires a light modulating element that is reflective across the wavelengths 
of interest, as well as a sensor that can record the desired spectral bands 
separately [link]. A block diagram is shown in [link]. 


The single sensor consists of a single spectrometer that spans the necessary 
wavelength range, which replaces the photodiode. The spectrometer records 
the intensity of the light reflected by the modulator in each wavelength. The 
same digital micromirror device (DMD) provides reflectivity for 
wavelengths from near infrared to near ultraviolet. Thus, by converting the 
datacube into a vector sorted by spectral band, the matrix that operates on 
the data to obtain the CS measurements is represented as 


Equation: 
},, O -: O 
0 ¢,, --- O 
d = 
0 0 Pz y 


This architecture performs multiplexing only in the spatial domain, i.e. 
dimensions z and y, since there is no mixing of the different spectral bands 
along the dimension A. 


Bitstream 


Reconstruction Image 


Block diagram for a single pixel hyperspectral camera. 
The photodiode is replaced by a spectrometer that 
captures the modulated light intensity for all spectral 
bands, for each of the CS measurements. 


Dual disperser coded aperture snapshot spectral imager 


The dual disperser coded aperture snapshot spectral imager (DD-CASSIJ), 
shown in [link], is an architecture that combines separate multiplexing in 
the spatial and spectral domain, which is then sensed by a wide-wavelength 
sensor/pixel array, thus flattening the spectral dimension [Link]. 


First, a dispersive element separates the different spectral bands, which still 
overlap in the spatial domain. In simple terms, this element shears the 
datacube, with each spectral slice being displaced from the previous by a 
constant amount in the same spatial dimension. The resulting datacube is 
then masked using the coded aperture, whose effect is to "punch holes" in 
the sheared datacube by blocking certain pixels of light. Subsequently, a 
second dispersive element acts on the masked, sheared datacube; however, 
this element shears in the opposite direction, effectively inverting the 
shearing of the first dispersive element. The resulting datacube is upright, 
but features "sheared" holes of datacube voxels that have been masked out. 


The resulting modified datacube is then received by a sensor array, which 
flattens the spectral dimension by measuring the sum of all the wavelengths 
received; the received light field resembles the target image, allowing for 
optical adjustments such as focusing. In this way, the measurements consist 
of full sampling in the spatial x and y dimensions, with an aggregation 
effect in the spectral A dimension. 


System Schematic Sensing Process 


Dual disperser coded aperture snapshot spectral imager (DD-CASSI). 
(a) Schematic of the DD-CASSI components. (b) Illustration of the 
datacube processing performed by the components. 


Single disperser coded aperture snapshot spectral imager 


The single disperser coded aperture snapshot spectral imager (SD-CASSIJ), 
shown in [link], is a simplification of the DD-CASSI architecture in which 
the first dispersive element is removed [link]. Thus, the light field received 
at the sensors does not resemble the target image. Furthermore, since the 
shearing is not reversed, the area occupied by the sheared datacube is larger 
than that of the original datacube, requiring a slightly larger number of 
pixels for the capture. 


System Schematic Sensing Process 
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Single disperser coded aperture snapshot spectral imager (SD-CASSI). 
(a) Schematic of the SD-CASSI components. (b) Illustration of the 
datacube processing performed by the components. 


Sparsity structures for hyperspectral datacubes 


Dyadic Multiscale Partitioning 


This sparsity structure assumes that the spectral signature for all pixels in a 
neighborhood is close to constant; that is, that the datacube is piecewise 
constant with smooth borders in the spatial dimensions. The complexity of 
an image is then given by the number of spatial dyadic squares with 
constant spectral signature necessary to accurately approximate the 
datacube; see [link]. A reconstruction algorithm then searches for the signal 
of lowest complexity (i.e., with the fewest dyadic squares) that generates 
compressive measurements close to those observed [link]. 


Spatial Dimension 2 


Spatial Dimension 1 


Example dyadic square partition for 
piecewise spatially constant datacube. 


Spatial-only sparsity 


This sparsity structure operates on each spectral band separately and 
assumes the same type of sparsity structure for each band [link]. The 
sparsity basis is drawn from those commonly used in images, such as 
wavelets, curvelets, or the discrete cosine basis. Since each basis operates 
only on a band, the resulting sparsity basis for the datacube can be 
represented as a block-diagonal matrix: 


Equation: 
Vz, O -::- O 
0 Y, -- O 
Y= 
0 0 - Wy 


Kronecker product sparsity 


This sparsity structure employs separate sparsity bases for the spatial 
dimensions and the spectral dimension, and builds a sparsity basis for the 
datacube using the Kronecker product of these two [link]: 

Equation: 


WZ — Wy ® Woy = Wy [2, 1]%r,y Wy [2, 2|%n,y 


In this manner, the datacube sparsity bases simultaneously enforces both 
Spatial and spectral structure, potentially achieving a sparsity level lower 
than the sums of the spatial sparsities for the separate spectral slices, 
depending on the level of structure between them and how well can this 
structure be captured through sparsity. 


Summary 


Compressive sensing will make the largest impact in applications with very 
large, high dimensional datasets that exhibit considerable amounts of 
structure. Hyperspectral imaging is a leading example of such applications; 
the sensor architectures and data structure models surveyed in this module 
show initial promising work in this new direction, enabling new ways of 
simultaneously sensing and compressing such data. For standard sensing 
architectures, the data structures surveyed also enable new transform 
coding-based compression schemes. 


Compressive processing of manifold-modeled data 
This module outlines the connection between compressive sensing and 
random projections of manifolds. 


A powerful data model for many applications is the geometric notion of a 
low-dimensional manifold. Data that possesses merely A intrinsic degrees 
of freedom” can be assumed to lie on a K-dimensional manifold in the 
high-dimensional ambient space. Once the manifold model is identified, 
any point on it can be represented using essentially K pieces of 
information. For instance, suppose a stationary camera of resolution NV 
observes a truck moving down along a straight line on a highway. Then, the 
set of images captured by the camera forms a 1-dimensional manifold in the 
image space R™. Another example is the set of images captured by a static 
camera observing a cube that rotates in 3 dimensions. ([link]). 
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(a) A rotating cube has 3 degrees of freedom, thus giving rise 
to a 3-dimensional manifold in image space. (b) Illustration of 
a manifold F parametrized by a K —dimensional vector 6. 


In many applications, it is beneficial to explicitly characterize the structure 
(alternately, identify the parameters) of the manifold formed by a set of 
observed signals. This is known as manifold learning and has been the 
subject of considerable study over the last several years; well-known 
manifold learning algorithms include Isomap [link], LLE [link], and 
Hessian eigenmaps [link]. An informal example is as follows: if a 2- 
dimensional manifold were to be imagined as the surface of a twisted sheet 


of rubber, manifold learning can be described as the process of “unraveling” 
the sheet and stretching it out on a 2D flat surface. [link] indicates the 
performance of Isomap on a simple 2-dimensional dataset comprising of 
images of a translating disk. 
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(a) Input data consisting of 1000 images of a disk shifted in K = 2 

dimensions, parametrized by an articulation vector (61, 02). (b) True 

9, and @5 values of the sampled data. (c) Isomap embedding learned 
from original data in RY. 


A linear, nonadaptive manifold dimensionality reduction technique has 
recently been introduced that employs the technique of random 
projections [link]. Consider a k-dimensional manifold .@ in the ambient 
space IR% and its projection onto a random subspace of dimension 

M = CK log (N); note that K < M << N. The result of [link] is that 
the pairwise metric structure of sample points from .@ is preserved with 
high accuracy under projection from R% to R™. This is analogous to the 
result for compressive sensing of sparse signals (see |The restricted 


projections required to preserve the ensemble structure does not depend on 
the sparsity of the individual images, but rather on the dimension of the 
underlying manifold. 


This result has far reaching implications; it suggests that a wide variety of 
signal processing tasks can be performed directly on the random projections 
acquired by these devices, thus saving valuable sensing, storage and 


processing costs. In particular, this enables provably efficient manifold 
learning in the projected domain [link]. [link] illustrates the performance of 
Isomap on the translating disk dataset under varying numbers of random 
projections. 
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Isomap embeddings learned from random projections of the 625 
images of shifting squares. (a) 25 random projections; (b) 50 random 
projections; (c) 25 random projections; (d) full data. 


The advantages of random projections extend even to cases where the 
original data is available in the ambient space R™. For example, consider a 
wireless network of cameras observing a static scene. The set of images 
captured by the cameras can be visualized as living on a low-dimensional 
manifold in the image space. To perform joint image analysis, the following 
steps might be executed: 


1. Collate: Each camera node transmits its respective captured image (of 
size N) to a central processing unit. 

2. Preprocess: The central processor estimates the intrinsic dimensionk 
of the underlying image manifold. 

3. Learn: The central processor performs a nonlinear embedding of the 
data points — for instance, using Isomap [link] — into a A’-dimensional 
Euclidean space, using the estimate of K from the previous step. 


In situations where NV is large and communication bandwidth is limited, the 
dominating costs will be in the first transmission/collation step. To reduce 
the communication expense, one may perform nonlinear image 
compression (such as JPEG) at each node before transmitting to the central 
processing. However, this requires a good deal of processing power at each 


sensor, and the compression would have to be undone during the learning 
step, thus adding to overall computational costs. 


As an alternative, every camera could encode its image by computing 
(either directly or indirectly) a small number of random projections to 
communicate to the central processor [link]. These random projections are 
obtained by linear operations on the data, and thus are cheaply computed. 
Clearly, in many situations it will be less expensive to store, transmit, and 
process such randomly projected versions of the sensed images. The 
method of random projections is thus a powerful tool for ensuring the stable 
embedding of low-dimensional manifolds into an intermediate space of 
reasonable size. It is now possible to think of settings involving a huge 
number of low-power devices that inexpensively capture, store, and 
transmit a very small number of measurements of high-dimensional data. 


Inference using compressive measurements 

This module provides an introduction to some simple algorithms for 
compressive signal processing, i.e., processing compressive measurements 
directly without first recovering the signal to solve an inference problem. 


While the compressive sensing (CS) literature has focused almost 
exclusively on problems in signal reconstruction/approximation, this is 
frequently not necessary. For instance, in many signal processing 
applications (including computer vision, digital communications and radar 
systems), signals are acquired only for the purpose of making a detection or 
classification decision. Tasks such as detection do not require a 
reconstruction of the signal, but only require estimates of the relevant 
sufficient statistics for the problem at hand. 


As a simple example, suppose a surveillance system (based on compressive 
imaging) observes the motion of a person across a static background. The 
relevant information to be extracted from the data acquired by this system 
would be, for example, the identity of the person, or the location of this 
person with respect to a predefined frame of coordinates. There are two 
ways of doing this: 


e Reconstruct the full data using standard sparse recovery techniques 
and apply standard computer vision/inference algorithms on the 
reconstructed images. 

e Develop an inference test which operates directly on the compressive 
measurements, without ever reconstructing the full images. 


A crucial property that enables the design of compressive inference 
algorithms is the information scalability property of compressive 
measurements. This property arises from the following two observations: 


e For certain signal models, the action of a random linear function on the 
set of signals of interest preserves enough information to perform 
inference tasks on the observed measurements. 

e The number of random measurements required to perform the 
inference task usually depends on the nature of the inference task. 
Informally, we observe that more sophisticated tasks require more 
measurements. 


We examine three possible inference problems for which algorithms that 
directly operate on the compressive measurements can be developed: 
detection (determining the presence or absence of an information-bearing 
signal), classification (assigning the observed signal to one of two (or more) 
signal classes), and parameter estimation (calculating a function of the 
observed signal). 


Detection 


In detection one simply wishes to answer the question: is a (known) signal 
present in the observations? To solve this problem, it suffices to estimate a 
relevant sufficient statistic. Based on a concentration of measure inequality, 
it is possible to show that such sufficient statistics for a detection problem 
can be accurately estimated from random projections, where the quality of 
this estimate depends on the signal to noise ratio (SNR) [link]. We make no 
assumptions on the signal of interest s, and hence we can build systems 
capable of detecting s even when it is not known in advance. Thus, we can 
use random projections for dimensionality-reduction in the detection setting 
without knowing the relevant structure. 


In the case where the class of signals of interest corresponds to a low 
dimensional subspace, a truncated, simplified sparse approximation can be 
applied as a detection algorithm; this has been dubbed as IDEA [link]. In 
simple terms, the algorithm will mark a detection when a large enough 
amount of energy from the measurements lies in the projected subspace. 
Since this problem does not require accurate estimation of the signal values, 
but rather whether it belongs in the subspace of interest or not, the number 
of measurements necessary is much smaller than that required for 
reconstruction, as shown in [link]. 
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Performance for IDEA. (Top) Sample wideband chirp signal and same 
chirp embedded in strong narrowband interference. (Bottom) 
Probability of error to reconstruct and detect chirp signals embedded 
in strong sinusoidal interference (SIR = —6 dB) using greedy 
algorithms. In this case, detection requires 3x fewer measurements 
and 4x fewer computations than reconstruction for an equivalent 
probability of success. Taken from [link]. 


Classification 


Similarly, random projections have long been used for a variety of 
classification and clustering problems. The Johnson-Lindenstrauss Lemma 
is often exploited in this setting to compute approximate nearest neighbors, 
which is naturally related to classification. The key result that random 
projections result in an isometric embedding allows us to generalize this 
work to several new classification algorithms and settings [link]. 


Classification can also be performed when more elaborate models are used 
for the different classes. Suppose the signal/image class of interest can be 
modeled as a low-dimensional manifold in the ambient space. In such case 
it can be shown that, even under random projections, certain geometric 
properties of the signal class are preserved up to a small distortion; for 
example, interpoint Euclidean (£2) distances are preserved [link]. This 
enables the design of classification algorithms in the projected domain. One 


such algorithm is known as the smashed filter [link]. As an example, under 
equal distribution among classes and a gaussian noise setting, the smashed 
filter is equivalent to building a nearest-neighbor (NN) classifier in the 
measurement domain. Further, it has been shown that for a kK —dimensional 
manifold, M = O(K log N) measurements are sufficient to perform 
reliable compressive classification. Thus, the number of measurements 
scales as the dimension of the signal class, as opposed to the sparsity of the 
individual signal. Some example results are shown in [link](a). 
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Results for smashed filter image classification and parameter 
estimation experiments. (a) Classification rates and (b) average 
estimation error for varying number of measurements M and noise 
levels o for a set of images of several objects under varying shifts. As 
M increases, the distances between the manifolds increase as well, 
thus increasing the noise tolerance and enabling more accurate 
estimation and classification. Thus, the classification and estimation 
performances improve as o decreases and M increases in all cases. 
Taken from [link]. 


Estimation 


Consider asignal x € , and suppose that we wish to estimate some 
function f(a) but only observe the measurements y = zx, where @ is 
again an M x N matrix. The data streaming community has previously 
analyzed this problem for many common functions, such as linear 
functions, £, norms, and histograms. These estimates are often based on so- 
called sketches, which can be thought of as random projections. 


As an example, in the case where f is a linear function, one can show that 
the estimation error (relative to the norms of x and f) can be bounded by a 
constant determined by M. This result holds for a wide class of random 
matrices, and can be viewed as a straightforward consequence of the same 
concentration of measure inequality that has proven useful for CS and in 
proving the JL Lemma [link]. 


Parameter estimation can also be performed when the signal class is 
modeled as a low-dimensional manifold. Suppose an observed signal x can 
be parameterized by a K —dimensional parameter vector 0, where K < N. 
Then, it can be shown that with 0(K log NV’) measurements, the parameter 
vector can be obtained via multiscale manifold navigation in the 
compressed domain [link]. Some example results are shown in [link](b). 


Compressive sensor networks 
This module provides an overview of applications of compressive sensing 
in the context of distributed sensor networks. 


Sparse and compressible signals are present in many sensor network 
applications, such as environmental monitoring, signal field recording and 
vehicle surveillance. Compressive sensing (CS) has many properties that 
make it attractive in this settings, such as its low complexity sensing and 
compression, its universality and its graceful degradation. CS is robust to 
noise, and allows querying more nodes to obey further detail on signals as 
they become interesting. Packet drops also do not harm the network nearly 
as much as many other protocols, only providing a marginal loss for each 
measurement not obtained by the receiver. As the network becomes more 
congested, data can be scaled back smoothly. 


Thus CS can enable the design of generic compressive sensors that perform 
random or incoherent projections. 


Several methods for using CS in sensor networks have been proposed. 
Decentralized methods pass data throughout the network, from neighbor to 
neighbor, and allow the decoder to probe any subset of nodes. In contrast, 
centralized methods require all information to be transmitted to a 
centralized data center, but reduce either the amount of information that 
must be transmitted or the power required to do so. We briefly summarize 
each class below. 


Decentralized algorithms 


Decentralized algorithms enable the calculation of compressive 
measurements at each sensor in the network, thus being useful for 
applications where monitoring agents traverse the network during 
operation. 


Randomized gossiping 


In randomized gossiping [link], each sensor communicates MM random 
projection of its data sample to a random set of nodes, in each stage 
aggregating and forwarding the observations received to a new set of 
random nodes. In essence, a spatial dot product is being performed as each 
node collects and aggregates information, compiling a sum of the weighted 
samples to obtain MM CS measurements which becomes more accurate as 
more rounds of random gossiping occur. To recover the data, a basis that 
provides data sparsity (or at least compressibility) is required, as well as the 
random projections used. However, this information does not need to be 
known while the data is being passed. 


The method can also be applied when each sensor observes a compressible 
signal. In this case, each sensor computes multiple random projections of 
the data and transmits them using randomized gossiping to the rest of the 
network. A potential drawback of this technique is the amount of storage 
required per sensor, as it could be considerable for large networks . In this 
case, each sensor can store the data from only a subset of the sensors, where 
each group of sensors of a certain size will be known to contain CS 
measurements for all the data in the network. To maintain a constant error 
as the network size grows, the number of transmissions becomes O(kM n ) 
, where k represents the number of groups in which the data is partitioned, 
M is the number of values desired from each sensor and n are the number 
of nodes in the network. The results can be improved by using geographic 
gossiping algorithms [link]. 


Distributed sparse random projections 


A second method modifies the randomized gossiping approach by limiting 
the number of communications each node must perform, in order to reduce 
overall power consumption [link]. Each data node takes / projections of 
its data, passing along information to a small set of L neighbors, and 
summing the observations; the resulting CS measurements are sparse, since 
N  Lof each row's entries will be zero. Nonetheless, these projections 
can still be used as CS measurements with quality similar to that of full 
random projections. Since the CS measurement matrix formed by the data 
nodes is sparse, a relatively small amount of communication is performed 


by each encoding node and the overall power required for transmission is 
reduced. 


Centralized algorithms 


Decentralized algorithms are used when the sensed data must be routed to a 
single location; this architecture is common in sensor networks were low 
power, simple nodes perform sensing and a powerful central location 
performs data processing. 


Compressive wireless sensing 


Compressive wireless sensing (CWS) emphasizes the use of synchronous 
communication to reduce the transmission power of each sensor [link]. In 
CWS, each sensor calculates a noisy projection of their data sample. Each 
sensor then transmits the calculated value by analog modulation and 
transmission of a communication waveform. The projections are aggregated 
at the central location by the receiving antenna, with further noise being 
added. In this way, the fusion center receives the CS measurements, from 
which it can perform reconstruction using knowledge of the random 
projections. 


A drawback of this method is the required accurate synchronization. 
Although CWS is constraining the power of each node, it is also relying on 
constructive interference to increase the power received by the data center. 
The nodes themselves must be accurately synchronized to know when to 
transmit their data. In addition, CWS assumes that the nodes are all at 
approximately equal distances from the fusion center, an assumption that is 
acceptable only when the receiver is far away from the sensor network. 
Mobile nodes could also increase the complexity of the transmission 
protocols. Interference or path issues also would have a large effect on 
CWS, limiting its applicability. 


If these limitations are addressed for a suitable application, CWS does offer 
great power benefits when very little is known about the data beyond 


sparsity in a fixed basis. Distortion will be proportionalto MM % “% , 


where qa is some positive constant based on the network structure. With 
much more a priori information about the sensed data, other methods will 
achieve distortions proportional to M°. 


Distributed compressive sensing 


Distributed Compressive Sensing (DCS) provides several models for 
combining neighboring sparse signals, relying on the fact that such sparse 
signals may be similar to each other, a concept that is termed joint 

sparsity [link]. In an example model, each signal has a common component 
and a local innovation, with the commonality only needing to be encoded 
once while each innovation can be encoded at a lower measurement rate. 
Three different joint sparsity models (JSMs) have been developed: 


1. Both common signal and innovations are sparse; 
2. Sparse innovations with shared sparsity structure; 
3. Sparse innovations and dense common signal. 


Although JSM 1 would seem preferable due to the relatively limited amount 
of data, only JSM 2 is computationally feasible for large sensor networks; it 
has been used in many applications [link]. JSMs 1 and 3 can be solved 
using a linear program, which has cubic complexity on the number of 
sensors in the network. 


DCS, however, does not address the communication or networking 
necessary to transmit the measurements to a central location; it relies on 
standard communication and networking techniques for measurement 
transmission, which can be tailored to the specific network topology. 


Genomic sensing 
This module describes the application of compressive sensing to the design 
of new kinds of DNA microarray probes. 


Biosensing of pathogens is a research area of high consequence. An 
accurate and rapid biosensing paradigm has the potential to impact several 
fields, including healthcare, defense and environmental monitoring. In this 
module we address the concept of biosensing based on compressive sensing 
(CS) via the Compressive Sensing Microarray (CSM), a DNA microarray 
adapted to take CS-style measurements. 


DNA microarrays are a frequently applied solution for microbe sensing; 
they have a significant edge over competitors due to their ability to sense 
many organisms in parallel [link], [link]. A DNA microarray consists of 
genetic sensors or spots, each containing DNA sequences termed probes. 
From the perspective of a microarray, each DNA sequence can be viewed as 
a sequence of four DNA bases {A, 7’, G, C} that tend to bind with one 
another in complementary pairs: A with T and G with C’. Therefore, a 
DNA subsequence in a target organism's genetic sample will tend to bind or 
“hybridize” with its complementary subsequence on a microarray to form a 
stable structure. The target DNA sample to be identified is fluorescently 
tagged before it is flushed over the microarray. The extraneous DNA is 
washed away so that only the bound DNA is left on the array. The array is 
then scanned using laser light of a wavelength designed to trigger 
fluorescence in the spots where binding has occurred. A specific pattern of 
array spots will fluoresce, which is then used to infer the genetic makeup in 
the test sample. 
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There are three issues with the traditional microarray design. Each spot 
consists of probes that can uniquely identify only one target of interest 
(each spot contains multiple copies of a probe for robustness.) The first 
concern with this design is that very often the targets in a test sample have 
similar base sequences, causing them to hybridize with the wrong probe 
(see [link]). These cross-hybridization events lead to errors in the array 
readout. Current microarray design methods do not address cross-matches 
between similar DNA sequences. 


The second concern in choosing unique identifier based DNA probes is its 
restriction on the number of organisms that can be identified. In typical 
biosensing applications multiple organisms must be identified; therefore a 
large number of DNA targets requires a microarray with a large number of 
spots. In fact, there are over 1000 known harmful microbes, many with 
more than 100 strains. The implementation cost and processing speed of 
microarray data is directly related to its number of spots, representing a 
significant problem for commercial deployment of microarray-based 
biosensors. As a consequence readout systems for traditional DNA arrays 
cannot be miniaturized or implemented using electronic components and 
require complicated fluorescent tagging. 


The third concern is the inefficient utilization of the large number of array 
spots in traditional microarrays. Although the number of potential agents in 
a sample is very large, not all agents are expected to be present in a 
significant concentration at a given time and location, or in an air/water/soil 
sample to be tested. Therefore, in a traditionally designed microarray only a 
small fraction of spots will be active at a given time, corresponding to the 
few targets present. 


To combat these problems, a Compressive Sensing DNA Microarray (CSM) 
uses “combinatorial testing sensors” in order to reduce the number of sensor 
spots [link], [link], [link]. Each spot in the CSM identifies a group of target 


organisms, and several spots together generate a unique pattern identifier 
for a single target. (See also "Group testing and data stream algorithms".) 
Designing the probes that perform this combinatorial sensing is the essence 
of the microarray design process, and what we aim to describe in this 
module. 


To obtain a CS-type measurement scheme, we can choose each probe in a 
CSM to be a group identifier such that the readout of each probe is a 
probabilistic combination of all the targets in its group. The probabilities are 
representative of each probe's hybridization affinity (or stickiness) to those 
targets in its group; the targets that are not in its group have low affinity to 
the probe. The readout signal at each spot of the microarray is a linear 
combination of hybridization affinities between its probe sequence and each 
of the target agents. 
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[link] illustrates the sensing process. To formalize, we assume there are 
spots on the CSM and NN targets; we have far fewer spots than target agents. 
For 1 <i < Mandl <j7< UN, the probe at spot z hybridizes with target 7 
with affinity py; ;. The target 7 occurs in the tested DNA sample with 
concentration x, so that the total hybridization of spot 7 is 


N 
n=), i—1 Pi,jtj = Pix, where ~; and x are a row and column vector, 


respectively. The resulting measured microarray signal intensity vector 
y = {y:}; =1,..., M fits the CS measurement model y = &z. 


While group testing has previously been proposed for microarrays [link], 
the sparsity in the target signal is key in applying CS. The chief advantage 
of a CS-based approach over regular group testing is in its information 
scalability. We are able to not just detect, but estimate the target signal with 
a reduced number of measurements similar to that of group testing [link]. 
This is important since there are always minute quantities of certain 
pathogens in the environment, but it is only their large concentrations that 
may be harmful to us. Furthermore, we are able to use CS recovery methods 
such as Belief Propagation that decode x while accounting for experimental 
noise and measurement nonlinearities due to excessive target 

molecules [link]. 


£_1 minimization proof 
In this module we prove one of the core lemmas that is used throughout this 
course to establish results regarding €_1 minimization. 


We now establish one of the core lemmas that we will use throughout this 
course. Specifically, [link] is used in establishing the relationship between 
the RIP and the NSP as well as establishing results on £;_minimization in 

the context of sparse recovery in both the noise-free and noisy settings. In 
order to establish [link], we establish the following preliminary lemmas. 


Suppose wu, v are orthogonal vectors. Then 
Equation: 


ella + llella < V2llu + all. 


We begin by defining the 2 x 1 vector w = [|lu||,, ||v||,]’. By applying 
standard bounds on £, norms (Lemma 1 from !"The RIP and the NSP") with 
K = 2, we have ||w||, < V2||w||,. From this we obtain 


Equation: 
5 S 2 
lullz + [lolly S V2 lell3 + [loll 


ull> + |vl|5 = |lu + vl|3, which yields the 


Since u and v are orthogonal, 
desired result. 


any pair of vectors u,v € 2x with disjoint support, 
Equation: 


(Pu, Bv)| < d2x||ullIloll- 


Suppose u,v € Xx with disjoint support and that ||w||, = ||v||, = 1. Then, 
utv€ Zox and ||u + v||5 = 2. Using the RIP we have 
Equation: 


2(1— dex) < ||Gu + So||5 < 2(1 + dox). 


Finally, applying the parallelogram identity 
Equation: 


a) 
(Su, Sv)| < = ||Su + Su|)} — Su — Su|3| < dex 


establishes the lemma. 


Let Ao be an arbitrary subset of {1, 2,...,. NW} such that |Ao|< K. For any 
vector u € R%, define A; as the index set corresponding to the K largest 
entries of we (in absolute value), A» as the index set corresponding to the 


next K largest entries, and so on. Then 
Equation: 


U Ac 
S- |luayll < Hevasthy 
JK 


jee 


We begin by observing that for 7 > 2, 
Equation: 

ua, sll, 
lua, | o> KB 


since the A; sort u to have decreasing magnitude. Applying standard 
bounds on £, norms (Lemma 1 from |The RIP and the NSP") we have 
Equation: 


YS balls < VEY Wea Se Mah, = a, 


q>2 je2 K 531 


proving the lemma. 


We are now in a position to prove our main result. The key ideas in this 
proof follow from [link]. 


Suppose that @ satisfies the RIP of order 2K. Let Ao be an arbitrary subset 
of {1, 2, ..., N} such that |Ag|< K, and let h € R” be given. Define A, as 
the index set corresponding to the K entries of h 4< with largest magnitude, 
and set A = Ap U Ay. Then 


Equation: 
||.s || Shy, Ph 
ieee ep 
VK Palle 
where 
Equation: 
— V 260% _ 1 
1 box’ ~~ 1 box! 


Since hy € 29x, the lower bound on the RIP immediately yields 


Equation: 


(1 — dox)||Ralls < |Phall3- 


Define A, as in [link], then since $hy = 6h — SY j>2 Pha,, we can rewrite 


[link] as 
Equation: 


(1 — 5ox)||halls < (Phy, Bh) — (thoes) 


j22 


In order to bound the second term of [link], we use [link], which implies 
that 
Equation: 


(Pha, Pha,)| < daxllra, 


gllha;ll,> 


for any 4, j. Furthermore, [link] yields ||h.4, ||, + |[h.4, lly < V2l|Palle- 
Substituting into [link] we obtain 


Equation: 
(co) =|S°(Ghj,,Pha,) + ¥(Gha,, Pha,) 
j22 j22 j22 


<)|(@hs,,Ph4;)| + > | |(Pha, Pha,)| 
j>2 j22 


< box lhaglly >— WlRa;ll, + FaxllAay lle d > lajll, 


j22 j22 


< V260xK||Rally ¥— WWRaylly- 
j>2 


From [link], this reduces to 
Equation: 


acl, 
VK 


< V2650x||hAlly 


(hs. Saha) 


j22 


Combining [link] with [link] we obtain 
Equation: 


(1 —dox)|[hallS < (ash) — (haha 


je2 
< |(Gh4, Gh)| + (hs e4,) 
je2 
< hac ll, 


\(Gh4, Ph) | + Vex ||hal|_,p——, 
ATG 


which yields the desired result upon rearranging. 


Sub-Gaussian random variables 

In this module we introduce the sub-Gaussian and strictly sub-Gaussian 
distributions. We provide some simple examples and illustrate some of the 
key properties of sub-Gaussian random variables. 


A number of distributions, notably Gaussian and Bernoulli, are known to 
satisfy certain concentration of measure inequalities. We will analyze this 
phenomenon from a more general perspective by considering the class of 
sub-Gaussian distributions [link]. 


A random variable X is called sub-Gaussian if there exists a constant 
c > 0 such that 
Equation: 


E(exp (Xt)) <exp (c?t?/2) 


holds for all ¢ € R. We use the notation X ~ Sub (c’) to denote that 
X satisfies [link]. 


The function E(exp (Xt)) is the moment-generating function of X, while 
the upper bound in [link] is the moment-generating function of a Gaussian 
random variable. Thus, a sub-Gaussian distribution is one whose moment- 
generating function is bounded by that of a Gaussian. There are a 
tremendous number of sub-Gaussian distributions, but there are two 
particularly important examples: 


Example: 

Xe ey (0, oh, i.e., X is a zero-mean Gaussian random variable with 
variance o”, then X ~ Sub (cake Indeed, as mentioned above, the 
moment-generating function of a Gaussian is given by 

E(exp (Xt)) =exp (o7t”/2), and thus [link] is trivially satisfied. 


Example: 
If X is a zero-mean, bounded random variable, i.e., one for which there 
exists a constant B such that |X| < B with probability 1, then 


X ~ Sub (B’) : 


A common way to characterize sub-Gaussian random variables is through 
analyzing their moments. We consider only the mean and variance in the 
following elementary lemma, proven in [link]. 

(Buldygin-Kozachenko [link ]) 


If X ~ Sub (ce) then, 


Equation: 
E(x) =0 
and 
Equation: 
E (X ) < ¢?. 


[link] shows that if X ~ Sub (c’) then E eo ) < c?. Insome settings it 
will be useful to consider a more restrictive class of random variables for 
which this inequality becomes an equality. 


A random variable X is called strictly sub-Gaussian if X ~ Sub (o”) 
where o? = E (Xx IP i.e., the inequality 
Equation: 


E(exp (Xt)) <exp (o°t? /2) 


holds for all ¢ € IR. To denote that X is strictly sub-Gaussian with 
variance co”, we will use the notation X ~ SSub (7). 


Example: 
If X ~ (0,07), then X ~ SSub(o’). 


Example: 
If X ~ U(—1,1), ie., X is uniformly distributed on the interval [—1, 1], 
then X ~ SSub(1/3). 


Example: 
Now consider the random variable with distribution such that 
Equation: 

a 1l—s 


P(X=1)=P(X=-1)=—.—, P(X=0)=s, s€[(0,1). 


For any s € [0, 2/3], X ~ SSub(1 — s). For s € (2/3, 1), X is not 
strictly sub-Gaussian. 


We now provide an equivalent characterization for sub-Gaussian and 
strictly sub-Gaussian random variables, proven in [link], that illustrates 
their concentration of measure behavior. 

(Buldygin-Kozachenko [link ]) 


A random variable X ~ Sub (c?) if and only if there exists ato > 0 anda 
constant a > 0 such that 
Equation: 

t2 


P (|X| > 4) <2exp (-s) 


for all t > to. Moreover, if X ~ SSub (o’), then [link] holds for all t > 0 
witha = oa. 


Finally, sub-Gaussian distributions also satisfy one of the fundamental 
properties of a Gaussian distribution: the sum of two sub-Gaussian random 
variables is itself a sub-Gaussian random variable. This result is established 
in more generality in the following lemma. 


Suppose that X = [X 1, Xo,..., Xw], where each X; is independent and 
identically distributed (i.i.d.) with X; ~ Sub (ce), Then for any a € RY, 


(X, a) ~ Sub (<*| Qa 2). Similarly, if each X; ~ SSub (o”), then for 
any a € RX, (X,a) ~ SSub(0?| a 2). 


Since the X; are i.i.d., the joint distribution factors and simplifies as: 
Equation: 


E (ox (: 3 ox =E (i exp (0X) 


E(exp (ta;X;)) 
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If the X; are strictly sub-Gaussian, then the result follows by setting 


c? = o” and observing that E((X,a)”) =o" || a ||3. 


Concentration of measure for sub-Gaussian random variables 
This module establishes concentration bounds for sub-Gaussian vectors and 
matrices. 


Sub-Gaussian distributions have a close relationship to the concentration of 
measure phenomenon [link]. To illustrate this, we note that we can combine 
Lemma 2 and Theorem 1 from "Sub-Gaussian random variables" to obtain 
deviation bounds for weighted sums of sub-Gaussian random variables. For our 
purposes, however, it will be more enlightening to study the norm of a vector of 
sub-Gaussian random variables. In particular, if X is a vector where each X; is 
iid. with X; ~ Sub (c), then we would like to know how || X ||, deviates from 
its mean. 


In order to establish the result, we will make use of Markov's inequality for 
nonnegative random variables. 


(Markov's Inequality) 
For any nonnegative random variable X and t > 0, 
Equation: 
E(X 
P(X >t)< : ) : 


Let f(x) denote the probability density function for X. 
Equation: 


B(x)= f° af(@jae> [ef (ede > [tf (@)de= (x= 0) 


In addition, we will require the following bound on the exponential moment of a 
sub-Gaussian random variable. 


Suppose X ~ Sub (c’). Then 
Equation: 
1 


E(exp (AX“/2c*)) < a 


for any A € (0, 1). 


First, observe that if A = 0, then the lemma holds trivially. Thus, suppose that 
A € (0,1). Let f(a) denote the probability density function for X. Since X is 
sub-Gaussian, we have by definition that 

Equation: 


[- exp (tx) f (a) dx <exp (c’t?/2) 


CO 


for any t € R. If we multiply by exp (—c7t?/2A), then we obtain 
Equation: 


[- exp (tx — c*t*/2) f (a) da <exp (c*t? (A — 1)/2A). 


(2) 


Now, integrating both sides with respect to t, we obtain 


Equation: 
[- ([- exp (tx — c*t?/2X) it) (x) dx < i exp (c7t? (A — 1)/2) dt, 


which reduces to 
Equation: 


= 1 2 
evan | exp (Ax?/2c’) f (x) dz < iJ i on 
c = c — 


This simplifies to prove the lemma. 


We now state our main theorem, which generalizes the results of [link] and uses 
substantially the same proof technique. 


Suppose that X = |X 1, X2,..., Xj], where each X; is i.i.d. with X; ~ Sub (c’) 
and E ) = o”. Then 
Equation: 


E(|| X ||3) = Mo’. 


Moreover, for any a € (0,1) and for any 8 € [c?/o7, Bmax|, there exists a 
constant kK > 4 depending only on Bmax and the ratio o” / c? such that 
Equation: 


P(| X |S < Mo?) <exp (M1 — a)?/«") 


and 
Equation: 


P(|| X || > BMo?) <exp (—M(6 -1)?/x"). 


Since the X; are independent, we obtain 
Equation: 


B(|| X |[3) = y E(x?) = oe? = Mo" 


and hence [link] holds. We now turn to [link] and [link]. Let us first consider 
[link]. We begin by applying Markov's inequality: 
Equation: 


P(| wi, BMo”) = P(exp (A || X ||3) Sexp (48M?) ) 
exp (A8BMo?) 


2 []/,E(exp (AX?)) 
exp (ABMo?) 


Since X; ~ Sub (c”), we have from [link] that 
Equation: 


E(exp (AX;)) = E(exp (2c°AX; /2c*)) < Fomrnee 


Thus, 
Equation: 
1 M/2 
E \X?)) < 
TIE(exp (AX;)) < (<a) 
and hence 
Equation: 


exp (—2A807) me 
1 — 2c? 


P(|| X ||; > Mo?) < ( 


By setting the derivative to zero and solving for A, one can show that the optimal 
A is 
Equation: 


7 Bo? _ C2 
~ 2c2o2 (1 +B)” 


Plugging this in we obtain 
Equation: 


; 52 Pe: M/2 
P(| Dales BMo”) < (6% exp (1 - 87) 


Similarly, 
Equation: 


: 2 ic 
P(| X ||\5 < Mo?) < (0% exp (1 = a7) 
c c 


In order to combine and simplify these inequalities, note that if we define 
Equation: 
D) 2 
* maxd /C — 1 
kK =max | 4,2 — Bmaxo"/e= 1 
(Bmaxo? /c = 1)— log (Bmaxo” /c) 


then we have that for any y € [0, Bao) c| we have the bound 
Equation: 


log (9) < (y¥-) - “1, 


and hence 
Equation: 


7 <exp (o-9 = m0 | 


By setting y = ao*/c?, [link] reduces to yield [link]. Similarly, setting 
7 = Bo? /c? establishes [link]. 


This result tells us that given a vector with entries drawn according to a sub- 
Gaussian distribution, we can expect the norm of the vector to concentrate around 
its expected value of Mo? with exponentially high probability as M grows. Note, 
however, that the range of allowable choices for @ in [link] is limited to 

| ee C2 / a? > 1. Thus, for a general sub-Gaussian distribution, we may be unable 
to achieve an arbitrarily tight concentration. However, recall that for strictly sub- 
Gaussian distributions we have that c? = o?, in which there is no such restriction. 
Moreover, for strictly sub-Gaussian distributions we also have the following 
useful corollary.[footnote | 

[link] exploits the strictness in the strictly sub-Gaussian distribution twice — first 
to ensure that 8 € (1, 2] is an admissible range for 6 and then to simplify the 
computation of « . One could easily establish a different version of this corollary 
for non-strictly sub-Gaussian vectors but for which we consider a more restricted 
range of € provided that c?/a? < 2. However, since most of the distributions of 
interest in this thesis are indeed strictly sub-Gaussian, we do not pursue this route. 


Note also that if one is interested in very small e, then there is considerable room 
for improvement in the constant C’. 


Suppose that X = |X 1, Xo,..., Xz], where each X; is i.i.d. with 
X; ~ SSub (07). Then 


Equation: 
2 
B(|| X ||3) = Mo? 


and for any € > 0, 
Equation: 


* 


Me? 
P(|I X |} - Mo?| > «Mo”) < 2exp (- : ) 
K 


with x = 2/(1— log (2)) + 6.52. 


Since each X; ~ SSub (07), we have that X; ~ Sub (co?) and E (X?) = o7, in 
which case we may apply [link] with a = 1 — € and 8 = 1+ e. This allows us to 
simplify and combine the bounds in [link] and [link] to obtain [link]. The value of 
k. follows from the observation that 1 + € < 2so that we can set Bmax = 2. 


Finally, from [link] we also have the following additional useful corollary. This 
result generalizes the main results of [link] to the broader family of general 
strictly sub-Gaussian distributions via a much simpler proof. 


Suppose that @ is an M x N matrix whose entries ¢;; are i.i.d. with 

yi; ~ SSub (1/M). Let Y = @z for z € RY. Then for any € > 0, and any 
zeER, 

Equation: 


E(I¥ 13) =le ll 


and 
Equation: 


Me? 
2 2 2 
P(\I¥ 13 -ll 13] > le 18) < 20 (2S) 


with K = 2/(1— log (2)) © 6.52. 

Let y; denote the i*® row of &. Observe that if Y; denotes the first element of Y, 
then Y; = (y;, x), and thus by Lemma 2 from "Sub-Gaussian random variables", 
Y; ~ SSub (| a ||5/M ) . Applying [link] to the M-dimensional random vector 
Y, we obtain [link]. 


